Git Product home page Git Product logo

Comments (15)

nevrome avatar nevrome commented on August 24, 2024

The Art & Architecture Thesaurus and the Thesaurus of Geographic Names seem to be great resource to contextualize and maybe even correct some of the data in the databases. They are LOD can be queried via SPARQL.

from c14bazaar.

dirkseidensticker avatar dirkseidensticker commented on August 24, 2024

With regard to phases and chronological phases one might consider http://perio.do/ as well. The data are available as JSON.

from c14bazaar.

nevrome avatar nevrome commented on August 24, 2024

Data cleaning API system OpenRefine.

from c14bazaar.

dakni avatar dakni commented on August 24, 2024

Just a short note from the spatial people: "country" as a spatial category does not need to be in the thesaurus. There is an ISO for it. Currently we start from coordinates, check to which country these fit, test whether the mentioned code corresponds to that country and if not select the problematic cases the throw appropriate warnings to the user.

When there are no coordinates the mentioned code/country name can just be tested against spelling what will be achieved with the help of the function below 👇


in general and in terms of typing errors you should check the fuzzyjoin package and the stringdist_joinfunction. This might simplify the process a lot.

from c14bazaar.

nevrome avatar nevrome commented on August 24, 2024

@dakni I defined three different variables in relation to country:

  • country: country value in the source database
  • country_coord: country as determined by matching the coordinates against the world map in rworldmap/rworldxtra
  • country_thes: country as determined by manual standardization in the thesaurification process

If we go without one of these variables we're loosing information. I'm not reluctant to do this as long as there are a good reason and method to do so.

from c14bazaar.

nevrome avatar nevrome commented on August 24, 2024

@dakni concerning the current discussion in #4

What does thesaurification mean here? It's the standardization of semantic attribution to allow for better filtering. For example in the variable material we have values like "human", "human tooth", "wild boar" or "bonetool". All of them are on a more abstract level "bone". Thesaurification means to add a column material_thes which contains "bone" for all those terms. That requires a deliberate decision of useful groups.

For the country variable things are slightly different, because the hierarchical differences in the meaning of the individual terms are not so strong. And we can determine the country value with the coordinates. But still things are far from simple. Let's consider three dates with the following entries in the source database:

labnr country lat lon
HU-1234 Jukoslavia 15.8 44.1
PU-2345 Germony 10 49
NU-3456 Österreich 13 52

Country determination by coordinates should result in:

labnr country lat lon country_coord
HU-1234 Jukoslavia 15.8 44.1 Croatia
PU-2345 Germony 10 49 Germany
NU-3456 Österreich 13 52 Germany

Thesaurification should yield:

labnr country lat lon country_thes
HU-1234 Jukoslavia 15.8 44.1 Yugoslavia
PU-2345 Germony 10 49 Germany
NU-3456 Österreich 13 52 Austria

Thesaurification and Country determination by coordinates have highly different results. We could now decide that some of this info is per definition useless. I would argue against that.

from c14bazaar.

dakni avatar dakni commented on August 24, 2024

In this example it gets clear that the data are erroneous. The mentioning of Austria is wrong in terms of the coordinates. This information shall be give to the user who is forced to decide what happens next (since this errors might not always be errors...thinking about the Austria-German example and the very rough/imprecise coordinate of 13 52)

I understand the first part of the thesaurifaction, where it make very much sense for me. For me it is basically a (re-)mapping. In the country case it is different because it is only a search for the exact language, potential typos, and a cross-check with the coordinates.

from c14bazaar.

dirkseidensticker avatar dirkseidensticker commented on August 24, 2024

I'm, like @nevrome against removing the country thesaurus. @dakni how would your approach deal with e.g. ISO 3166-1 alpha-3 three-letter country codes that are used within the aDRAC dataset? There you would need an additional layer to do the translation, either against a remote resource or the thesaurus, right?

from c14bazaar.

dakni avatar dakni commented on August 24, 2024

Salut!

this snippet below uses aDRAC to find the problematic cases based on location and the corresponding ISOcode. What's not yet implemented is the question how to deal with the problematic cases (i.e. new column, warning, etc.); also it gets obvious that the ISOcode is not "correct" since iso3166 is mixed with UN ISO codes. Hence, no "hand made" thesaurus is necessary but a comparison with published standards.

But to make my point clear: I am not against a thesaurus. We can provide a country-ISO-UN-code thesaurus. My point was just that the country "check" is different from the "content mapping" of the material column

## devtools::install_github("ISAAKiel/c14bazAAR")
library(c14bazAAR)
tmp <- get_aDRAC()
tmp

library(magrittr)


library(sf)
library(maps)
w <- st_as_sf(maps::map("world", plot = FALSE, fill = TRUE))

library(ISOcodes)
library(dplyr)
repUN <- ISOcodes::UN_M.49_Countries %>%
  tbl_df() %>%
  mutate(Name = gsub("^ ", "", Name))

tmp_sf <- tmp %>%
  filter(!is.na(lat)) %>%  
  st_as_sf(coords = c("lon","lat"),
           remove = FALSE,
           crs = 4326) %>%
  sf::st_join(y = w) %>%
  left_join(y = repUN, by = c(ID = "Name"))%>%
  mutate(check = case_when(country == ISO_Alpha_3 ~ "correct"))# %>%  
  #`class<-`(c("c14_date_list", class(.)))

problems <- tmp_sf %>%
  filter(is.na(check))

unique(problems$country)
unique(problems$country) %in% maps::iso3166$ISOname
unique(problems$country) %in% ISOcodes::UN_M.49_Countries$ISO_Alpha_3

## fuzzyjoin to achieve a check of the correct writing of the site names based on coordinates
library(fuzzyjoin)

from c14bazaar.

dirkseidensticker avatar dirkseidensticker commented on August 24, 2024

Thanks for the detailed explanation. Now I get your approach.

from c14bazaar.

dakni avatar dakni commented on August 24, 2024

No worries. Does it make sense to you/do you think it is useful? If not we stick to your original thesaurus approach and try to figure out ways to create it using ISO standards and a comprehensive language mapping.

On the one hand I would vote for such an approach, since the necessary information would be within the package. On the other hand the spatial aspect would be missing. To implement this, e.g. via using maps/rworld/rworldxtra package would result in redundant information since the ISO data are also in these packages. To provide a spatial representation of the countries as an sf object seems to be another possibility. However, this would (dramatically?) increase the size of the package.

from c14bazaar.

nevrome avatar nevrome commented on August 24, 2024

#4

I didn't make sense for me, so I just called Daniel. Here are the results of our discussion:

  1. Thesaurification for countries and materials are two highly different things. We therefore split the former thesaurify() into two functions:

    • classify_material(): Adds one new column material_thes to a c14_date_list with a simplified material key. The mapping relies on a manually curated list in c14bazAAR.
    • standardize_country_name(): Adds one new column country_thes to a c14_date_list with a standardized country key. The mapping relies on good automatic data cleaning and fuzzy matching with itself and external data sources (R Package(s) -> please ask @dakni).
  2. The Spatial Quality Estimation is actually a three step endeavor and will be separated accordingly into three functions:

    • determine_country_by_coordinates(): Adds one new column country_coords to a c14_date_list with a standardized country key. The mapping relies on polygon data from external sources (R Package(s) -> please ask @dakni)
    • estimate_spatial_info_quality(): Adds one new column spatial_quality_estimation to a c14_date_list with an estimation of the spatial quality based on comparison of country_coords and country_thes. The column contains the values wrong coords, no coords, doubtful coords, possibly correct.
    • estimate_coordinate_precision(): Adds one column coordinate_precision to a c14_date_list with the info how precise the coordinate information is (spatial deviation).

For standardize_country_name() and determine_country_by_coordinates() we need a reference list of different country names and codes as well as polygon definitions of country borders. @dakni knows some R packages that already contain this information in a well formatted form.

The function- and variable names defined here are not very nice yet. Please suggest better names.

Edit 26.1.18: Removed references to @dakni s package.

from c14bazaar.

nevrome avatar nevrome commented on August 24, 2024

As @nmueller18 suggested, material_classification() should have an option to see which material values in a c14_date_list are not yet mentioned in the mapping reference list. This will facilitate the list curation.

from c14bazaar.

nevrome avatar nevrome commented on August 24, 2024

@preiaen @dakni @ctietze91 @kschmuetz @whamer
Please see the comment above for reference. This is your ToDo list for c14bazAAR.

from c14bazaar.

nevrome avatar nevrome commented on August 24, 2024

I consider this mostly solved by #24, #25, #29 and #30

from c14bazaar.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.