Git Product home page Git Product logo

dataharmonization's People

Contributors

danielzuleta avatar saarthakmaini avatar valentineherr avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

saarthakmaini

dataharmonization's Issues

robust processing of dates

Dates seem to be a recurrent problem. There are mixes of formats within tables, and especially between different tables that users want to stack (different censuses). A sub-routine that handles dates in different formats seems a very useful addition.

zombie trees

zombie trees are trees that seem dead in some censuses and then are confirmed alive later. The dead code should be fixed and converted into the alive code, towards the past. Does the app implement this correction?

runGithub

is there any alternative to "shiny::runGithub"? Sometimes we cannot launch the app because timeouts and things like that. Once the package is installed, Shouldn't we be able to run the app without internet connection? Completely locally.

the online version is not being used very often in the ForestGEO workshop in Panama.

"status" is used ambiguously in step 2 of "headers and units"

There are a couple of questions that say "which or your status(es) represent a LIVE tree?" and "which or your status(es) represent a DEAD tree?"

These two refer to two different variables, but they are phrased exactly the same. I think it is better to make more explicit links. Perhaps by numbering the questions within that particular block? "Which of you [etc.]; (see question x of block xxxxx)"

BotanicalCorrection issue

if(length(pass.this.unique) != nrow(tnrs)) stop("some species did not pass through") else tnrs$Name_submitted <- pass.this.unique # this necessary when there is special characters

This line code don't pass with the test data

"Problem with the API: HTTP Status 400
Error in if (length(pass.this.unique) != nrow(tnrs)) stop("some species did not pass through") else tnrs$Name_submitted <- pass.this.unique : argument is of length zero"

many users are not familiar with how date formats are declared

It may be easier for the users to pick the format date from a dropdown menu that has examples. I mean, things like "12-31-1999" instead of "mm-dd-yyyy" etc.

At some point we talked about offering a summary of the formats that we could pick from the database, but I do not remember now having seen it in the app.

This is a vulnerability of the app -- in most cases we should be able to guess the date format, but users may declare the wrong format and make a mess.

names of the blocks in "headers and units"

The names of the blocks in "headers and units" are not too clear.

  • "plot measurements" for what seem to be mostly data summaries at the plot level. Perhaps "summaries at the plot level" or "plot attributes"
  • tree/stem information --> if these are just tree identifiers, just call the block "tree/stem identifiers"
  • the same with identifiers for plots/sites/subplots

May add more later

stem identifiers using points are treated as numeric

Some networks use 123.1, 123.2, 123.3, etc. as the identifier for stems. This seems to be interpreted as numeric, not character. This results in a loss of information where 123.10 changes to 123.1 and we end with duplicated identifiers. These identifiers have to be treated as characters.

is stacking and merging and tidyng in the profile?

These three steps should be included in the profile if we want to ensure reproducibility of the whole flow. This is important, I see users behaving a little bit erratically also during these three steps.

I assume these steps are not included in the profile, because the profile is uploaded later.

Why don't we include these steps in the profile?

anticipate new column names when tidying

When tidying the data, there are new column names, for example CensusID. These column names are not easily identified by the user later. It would be good to add a warning for the user, "remember that there will be a new column name called CensusID" or something like that.

Besides, what happens if there is already a column "CensusID"? It might happen in some cases.

pre-populate the keys to merge

Repeated column names in >1 table are likely to be the keys. It can be safer to pre-select them, and let the user un-select some if they do not apply. In some cases, the user will forget to check all the right keys, which generates the ".y" problem later.

We have that initial checklist, but users are frequently over-confident.

error when inputing app output

mentioned in #43, but an issue in itself:

people get app output get errors when loading again, as input, the dataset and profile just returned by the app. This means that the process of [running the app for reformatting] and then, in a second independent round, [running the app for data corrections] is not smooth.

Letters with ¨

Letters with ¨ appear as "?" within a black romboid in the pre-visualization in the app. I do not know if this is a problem that is carried to the output, or just a problem in that visualization in the shiny app. It affects to author names, mostly.

HOM vs POM?

Very unclear what the difference is. Users are confused. It seems the app treats HOM as a number, and POM as a code (breast, base? I never saw this). But this distinction is not standard, people use these two terms interchangeably.

upload everything at the beggining

It seems more intuitive to upload everything at the start, in the first tab:

(1) number of tables --> upload your tables
(2) upload the profile of your input, if you have it
(3) upload / pick the profile that describes your desired output, if you have it

this first tab is the place to explain the general process:

your data + description of your data format + description of your desired output format = your data in the desired output format

log of warnings and errors

some users are copy-pasting the pop-up windows with info about warnings and errors. It seems useful to have a log of those, along with any other problems during corrections etc., as part of the final output.

list of pioneers in corrections

The datasets can have hundreds or thousands of species. It is difficult to pick from a dropdown menu.

Users do not know each individual species, unless the person running the app is the expert local botanist. Students, colleagues, postdocs, etc. have no clue.

Because of the large quantity of options, this is a step essentially impossible to reproduce in practice (unless the selection of pioneers is stored in the profile).

Using genera instead of species may help making the dropdown menu easier, and also removes some variability in the user choices.

Another alternative is to search for wood density in the global wood density database or something similar, make guesses based on species (then genus, then family, then stand), and using some pre-defined threshold or rule inside the app. This goes in the direction of not allowing the user much flexibility through the app. Users that require flexibility can still do things outside the app. The app, as a data federation tool, is used by users that do not have clear goals with the corrections, and can make rather arbitrary choices. We cannot expect [consensuated criteria] + [perfect communication] between the different teams or individuals sharing data.

always remind the user to look

always remind the user to look to all the items within each block within "headers and units", to make sure that they know what they want to do, before starting to do it. They start filling right away.

splitting into blocks is very useful, but some of them are are long enough for the users not to visualize all the items at glance.

I do not think we need more blocks, I think the splitting into this number and size of blocks is practical.

correction for extreme growth: mention "outliers" or "error"

It can be more intuitive for some users to think in terms of outliers: "Some individuals grow too much. At what point will you consider individuals an outlier / error?"

I think "error" is what we mean, right? There is no reason to truncate extreme growths (and adjust dbhs) that are not errors.

reproducible corrections

Corrections have many options and it will be difficult (impossible) for two teams to run the same corrections.

Communication between different teams that want to aggregate data is, and will be, extremely limited. There is basically no mutual understanding of the datasets. People running the app share the output data, and not the profile, etc. etc. etc. etc. etc. etc. etc. etc. etc.

Anything that is not in the profile is a vulnerability in data federation, and for data corrections in particular. (Because stacking and merging have, in theory, objective rules, while the parameters in these corrections are subjective).

The good practice is to do merging of datasets and then using the app for corrections. But many of our warning texts go unnoticed by users. We need to enforce good practices beyond warning messages.

It may make sense to split the app in two, and add more detail in the corrections, because it is difficult to understand all those parameters on the fly.

plot, individual, stem, and almost nothing else

Use more strict terms when referring to identifiers. Avoid "tag", "tree", "trunk". Use dry descriptions of the identifiers:

"stem within individual"
"individual within plot"
"stem within the entire dataset"
"individual within the entire dataset
"plot within the entire dataset"

or stem-within-individual, individual-within-plot, and so on.

I do not understand well the difference between "site" and "cluster of plots". Cluster has spatial connotations. It may be appropriate for a majority of cases, but perhaps "group of plots" is more general.

mixing individual-level and stem-level table

Some users want to merge datasets from stem-level tables and tree-level tables. Is the app useful for that?

For teams that measure 1 stem per tree: Is it OK to declare "stem-level" as the more detailed level of the table? I do not see how individual-level vs. stem-level can make a difference.

"add more" as an option in the tidying step

The app makes a guess, offers several options, and a couple of empty/free slots.

This results in too few options in many cases, specially when the non-tidy columns have this structure:

[whatever] and [whatever]_t2

The guessing of clusters of variables is very useful in general, but it should not constraint the number of available slots. Use N/2 slots, where N is the number of columns.

users with too many columns as input

Some users have many columns and get overwhelmed by the dropdown menus. They suggest an intermediate step, somewhere before declaring headers, to filter their tables, so they check the columns of interest only.

ghost columns

When loading, many datasets appear with 1, 2 or several extra columns that are (apparently) empty, called "V" followed by a number. These columns block stacking, so they represent a root problem that needs to be fixed in Excel, etc.

(1) can we make a general check at loading, and avoid "empty" columns without column names?

(2) can we allow the users to select, explicitly, the columns that they want to use? This is a feature that could be useful for data federation of datasets that store many variables in their tables that are irrelevant in collaborative projects

add units in headers of our app default

I think it is better to add explicitly the units in the headers of our default format. It seems the default format will be used often. People forget about the possibility that they original units have changed. They keep the inertia of the original units. Everything is in metadata, and the metadata table is clearly a must. But most of our users infer variable content by looking at their data table in Excel -- consistent use of metadata tables or other forms of documentation are not as common as one would think.

long story short, having the units incorporated into the variable names can help.

typos

  • corresponf typo somewhere, instead of "correspond"

may add more later

name vs id

The distinction is not very clear, when it refers to plots, subplots, etc. I wonder how frequently people keep plot names and unique plot id's in different columns. It seems more frequent that their plot name is their unique plot id, rather abstract alphanumeric like "ABC1", "ABC2", etc.

Consider mapping into a single column called "unique-plot-id", that could take numbers or names. But I am not fully aware of the reasoning behind the current design.

concatenation of codes

there are minor problems arising from this.

some users do not have codes at all. The default option should be "It does not apply" or something like that, at least when the user did not choose any variable with codes, in step 1 of "headers and units".

I think it can be better if instead of just one question in the step 2, we present one question for each of the selected columns, or no question whatsoever if the user did not choose any column.

the user should be explicit in the punctuation points that they use. We can extract the options from the content of those variables instead of offering a pre-selected list.

the separation may be different in different columns.

some codes have the form "not.recruited", with points. People love points. This makes a mess. We want users to be able to un-select points (in general, to choose explicitly the appropriate separators between codes).

table pre view in "headers and units" step

Users look at their tables constantly when working on this tab. The previous tab had a preview. But it seems useful to have a preview in the "headers and units" tab itself.

For example, in a single column, one thing below the other:
(1) pre visualization of the plot headers and content
(2) step 1 (stuff that is on the table)
(3) step 2 (stuff that needs to be declared independently)

specific epithet

"species name (one word)" in one of the blocks of "headers and units" is not a correct use of the term.

use "specific epithet".

the species name is a latin binomial that contains the [genus] followed by [the specific epithet].

"species" is one of the most confusing possible column names, contains almost anything. For example, some people can store "species" as "Moraceae_Ficus_maxima". We should not add to the confusion.

some specific epithets are "unknown first". Check if TNRS works with something like that.

life forms or habits

This is useful in data federation activities because some people do lianas while others don't, etc., so we want that info to be kept to the end of the process, at least in our defaults.

The list of life forms is not complete or too clear, however. Add "hemiepiphytes". "Shrub" does not mean anything in particular, they are treated as multi-stemmed individuals based on their minimum dbh. As far as I know, nobody excludes shrubs per se.

incongruent year and date

One case has "year" = "2001" and "date" = "11Nov2000". It is because the "year" is used as an identifier of the census, and is repeated across individuals regarding of the specific date. If the census happens during two different natural years, this may cause problems, depending on how we handle the dates.

This reinforces the idea that we need a robust sub-routine for dates.

It could be good to add text in "date" saying "this could vary between trees within the same plot, if they were measured in different dates", in contrast to text in "census id" saying "this should not change between trees measured in the same census, even if they were measured in different dates". This is rather obvious, but having similar texts highlighting the differences between variables can help.

using "year" as a census id seems common. Also, when storing data in the wide format, people use the year sometimes, e.g. dbh_2000, dbh_2003, dbh_2006, etc.

app dies often

the app seems to die often, or disappear (the console shows the app is running, but the window is nowhere). I do not know why this happens, but it is happening in slightly different ways in different machines.

in my case, it seems that running the app in the browser gives fewer problems. Other users say that the pop-up window give fewer problems than the app in the browser. I have no idea, but this is one of the most annoying issues we are having.

Life/Dead status in different columns

Currently I only ask for one LifeStatus column, and the user needs to select the values in LifeStatus that mean "dead" and the one that mean "alive". Any other value is considered as "NA".

This works well except that ForestPlots (which is following RAINFOR) has 2 different columns:
One for "alive status" (F1, and trees that are dead get a "0" there) and one for "dead status" (F2, and trees that are alive get a "1" there).

See https://rainfor.org/wp-content/uploads/sites/129/2022/06/RAINFOR_data_codes_EN.pdf

If we only ask for one column, we can capture one status properly (e.g. if F2 is selected, we can accurately say who is "alive" but we won't have an exhaustive list of the all the possible codes for dead trees... thus, some dead status may be changed to "NA" in a different dataset).

I can "easily" fix that by asking 2 times for a status column. One that gives a clear picture of who is "alive" and one for the "dead" ones. Most plots may select the same column for both, but ForestPlots, will be able to select F2 and F1, respectively. So they will be able to select "1" in F2 to indicate the trees that are "alive" and "0" in F1 to indicate the trees that "dead".

@cpiponiot do you know other networks that may benefit from that?

The problem with this, is that it will make the profiles that already exist obsolete.... so people will need to update their profile.
But I think it is worth it...

checkboxes and "save" button for blocks at "headers and units"

This is a long section. One of the main issues is that users lose their work because they don't save the changes as they advance. (The button is at the end and they don't see it).

This is related to the fact that the app closes suddenly or the window disappears and we cannot locate it again, forcing us to re-start.

Another issue is that there are many blocks, their names are not super-easy to remember, and they lose track of the blocks that they have completed.

Is it possible to add this?:

(1) a checkbox to the side of each block, to mark the blocks as "done" or "completed"
(2) a "save changes" button just below each block, so they save their work more frequently. Alternatively, auto-saving every minute or whatever could be useful, but I imagine that would be more complicated.

Not obvious explanation in the "tidying" step

Some users list long lists of things that are repeated, instead of grouping the pairs or triplets of variable names that apply.

For example, in a database that has these variables:

dbh
dbh_time2
height
height_time2
is_alive
is_alive_time2

they list "dbh, height, is_alive" once, instead of listing

dbh, dbh_time2
height, height_time2
is_alive, is_alive2

I guess the text that guides the user in creating these clusters of variables is not very transparent. Perhaps some examples are helpful.

do not remove trees below minimum dbh

Are trees below minimum dbh removed in any step? Don't do it. Trees could be 10.00 cm at census 1 and 9.99 cm at census 2, etc. It is ok to follow these trees regardless of how their dbh changes.

extra columns with V1, V2, V3, V4, etc.

Sometimes there are many extra columns that the user identify as "empty". Why do these columns appear? Perhaps they have a blank space in some isolated cell somewhere. What is the safety check they should do to avoid this problem? Can we incorporate a criterion to decide whether a column is "empty" (empty of meaningful information? Would that be safe to do?

guess dates format better

we may need a sub-routine for guessing the date formats better. Some users have no clue, they just look at the top first rows and make a guess that will be, in almost all cases, no better than the guess that the app can make by itself.

are spaces in table names ok?

are spaces in table names ok?
if this is a constraint, or there is any constraint in the table name, we should say so.

decimals separated by commas

A dataset is giving problems, always appearing with decimals separated by commas.

We have saved it as csv, UTF-8, and when opening it in Excel or as a text file the decimal degrees seem separated by dots, not commas. Still, decimals appear with commas in Shiny, making most DBHs to end as NA in the output table.

I cannot see how this could be an app problem. It seems an Excel problem, but I report just in case.

step 1 and step 2 in "headers and units"

There is a warning but users don't pay attention, and work on steps 1 and 2 simultaneously, in parallel. It would be better to have the step 1 at the top, and step 2 below. It will help users to work sequentially.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.