idas-durham / june Goto Github PK

View Code? Open in Web Editor NEW

36.0 36.0 10.0 177.87 MB

June is a framework for agent based modelling in an epidemiological and geographical context.

License: GNU General Public License v3.0

Python 84.60% Shell 0.03% Jupyter Notebook 15.37%

epidemiology numerical-modelling

june's People

Contributors

Stargazers

Watchers

Forkers

p-robot arnauqb talismanbrandi jonathanfrawley durhamarc xwyangjshb pie923 ayushchopra96 phac-nml jzanetti

june's Issues

The use of properties makes testing harder?

I am having some troubles to set up tests in which I have control on some of the properties. For example, I want to test what happens when I send a person to hospital depending on their health_information.tag value. However, I can not simply set that value to what I'd like, is their a simple solution for this?

Inputs refactoring (it'd be great to get feedback)

Right now the files reformat.py and inputs.py are complete Frankensteins. I'd like to refactor them following this logic:

I will change the data folder structure. There will be two subfolders: data/raw/ and data/processed/ . They will be mirror images of each other. All the data directly downloaded will go into data/raw, after we clean it up and process it, it will move to data/processed/ . Bare in mind that data raw can be as specific to our UK scenario as we want it to be, however when we process data we need to think about simplifying inputs and generalising to other countries.
The scripts used to clean data from raw to processed, will be stored at covid/reformat_scripts/ , whose structure will also be a mirror image of data/raw/ .
inputs.py will follow the same logic as world.py . The init will be mostly empty, and each group will have a function like get_inputs_households or get_inputs_companies , that will read in all the necessary datasets from data/processed . When world initialises the different active groups, it will first call the function needed to get the data for that particular group using inputs.py

Improve age distribution inside NOMIS age bins

Currently, we initialize people with an age taken form a uniform distribution inside their age bin given by the NOMIS dataset. This should be improved using more fine-grain data, especially for the last population bin, where it is likely to have a sharp negative slope.

Does commute generator need to be a class?

class CommuteGenerator:
    def __init__(
            self,
            regional_generators: Dict[str, RegionalGenerator]
    ):
        """
        Generate a mode of transport that a person uses in their commute.

        Modes of transport are chosen randomly, weighted by the numbers taken
        from census data for each given Output area.

        Parameters
        ----------
        regional_generators
            A dictionary mapping Geography msoareas to objects that randomly
            generate modes of transport
        """
        self.regional_generators = regional_generators

    def regional_gen_from_msoarea(self, msoarea: str) -> RegionalGenerator:
        """
        Get a regional generator for an Output Area identified
        by its output msoarea, e.g. E00062207

        Parameters
        ----------
        msoarea
            A code for an msoarea

        Returns
        -------
        An object that weighted-randomly selects modes of transport for the region.
        """
        return self.regional_generators[
            msoarea
        ]

    @classmethod
    def from_file(
            cls,
            filename: str,
            config_filename: str = default_config_filename
    ) -> "CommuteGenerator":
        """
        Parse configuration describing each included mode of transport
        along with census data describing the weightings for modes of
        transport in each output area.

        Parameters
        ----------
        filename
            The path to the commute.csv file.
            This contains data on the number of people using each mode
            of transport.
        config_filename
            The path to the commute.yaml file

        Returns
        -------
        An object used to generate commutes
        """
        regional_generators = dict()
        with open(filename) as f:
            reader = csv.reader(f)
            headers = next(reader)
            msoarea_column = headers.index("geography code")
            modes_of_transport = ModeOfTransport.load_from_file(
                config_filename
            )
            for row in reader:
                weighted_modes = list()
                for mode in modes_of_transport:
                    weighted_modes.append((
                        int(row[
                                mode.index(headers)
                            ]),
                        mode
                    ))
                msoarea = row[msoarea_column]
                regional_generators[msoarea] = RegionalGenerator(
                    msoarea=msoarea,
                    weighted_modes=weighted_modes
                )

        return CommuteGenerator(
            regional_generators
        )

I'm sensing a theme with my git issues.

Feels to me like this is another example of something that doesn't need to be a class, but can just be stored as a list of dictionaries wherever it is used? Again, we'd need to move the config setup somewhere else.

Will tackle this once we have test coverage over the whole code.

Load configuration once to avoid excessive IO

parameters.py seems to load configuration every time an object is instantiated. This causes an excessive number of IO operations which is making things run very slowly.

As part of a PR I've added a caching mechanism which shaves 10 seconds off the fullrun test.

It would be better if configuration were loaded then passed to objects to instantiate them. That way:

Objects are less "magical". They are less likely to have weird behaviours because of implicit loading
Objects are more flexible - you can make an object any way you want without having to load up configuration
IO operations can be minimised - you can load configuration in one place and then pass it to multiple objects to instantiate them

A dictionary loaded from YAML can be used to instantiate a class in a fairly concise way.

cls(**dict)

In this case the dictionary keys must at least include every required argument and not include any arguments that don't exists. I'd argue this is safer, cleaner and more concise then looping through the keys and values in a dictionary and calling setattr.

Can we make ModeOfTransport a string?

class ModeOfTransport:
    __all = dict()

    def __new__(cls, description):
        if description not in ModeOfTransport.__all:
            ModeOfTransport.__all[
                description
            ] = super().__new__(cls)
        return ModeOfTransport.__all[
            description
        ]

    def __init__(
            self,
            description
    ):
        self.description = description

    def index(self, headers: List[str]) -> int:
        """
        Determine the column index of this mode of transport.

        The first header that contains this mode of transport's description
        is counted.

        Parameters
        ----------
        headers
            A list of headers from a CSV file.

        Returns
        -------
        The column index corresponding to this mode of transport.

        Raises
        ------
        An assertion error if no such header is found.
        """
        for i, header in enumerate(headers):
            if self.description in header:
                return i
        raise AssertionError(
            f"{self} not found in headers {headers}"
        )

    def __eq__(self, other):
        if isinstance(other, str):
            return self.description == other
        if isinstance(other, ModeOfTransport):
            return self.description == other.description
        return super().__eq__(other)

    def __hash__(self):
        return hash(self.description)

    def __str__(self):
        return self.description

    def __repr__(self):
        return f"<{self.__class__.__name__} {self}>"

    @classmethod
    def load_from_file(
            cls,
            config_filename=default_config_filename
    ) -> List["ModeOfTransport"]:
        """
        Load all of the modes of transport from commute.yaml.

        Modes of transport are globally unique. That is, even if the function
        is called twice identical mode of transport objects are returned.

        Parameters
        ----------
        config_filename
            The path to the mode of transport yaml configuration

        Returns
        -------
        A list of modes of transport
        """
        with open(config_filename) as f:
            configs = yaml.load(f, Loader=yaml.FullLoader)
        return [
            ModeOfTransport(
                config["description"]
            )
            for config in configs
        ]

Is there any reason that this needs to be a class as opposed to a string? Looking within the commute module it seems a bit overkill to make this a class, assuming that:

We can load it from configs as needs be in functions that handle the configs.
The index function can likely be moved to whatever class actually uses it (e.g. this seems to just be when loading configs, so can probably be performed during config generation).

Once I'm more familiar with the code-base I'll have a go at sorting this.

Unjustified unemployed

There are some cases that might slip through the if-statements in this routine:
https://github.com/JosephPB/covidmodelling/blob/35e6f0cb73863ecce28c84aca82d2b42d19a44f3/covid/groups/companies/company_distributor.py#L40

E.g. all companies in a industry sector can already be full in a certain MSOArea and therefore a employable person won't be placed into a company. Therefore we need to place them thoughtfully into another company. Or we could think about leaving them unemployed if it mirrors the Census data.

Add branch protection

Currently tests are failing all the time. Ideally nothing should be merged into master without all tests passing - this means people can rely on master being a working version of the project.

One way to prevent failing tests on master is to add branch protection rules where all checks must pass before merging is allowed. This way pull requests with broken tests cannot be merged until they are fixed.

Manage people working/living outside the simulated region

In this line of the code:
https://github.com/JosephPB/covidmodelling/blob/35e6f0cb73863ecce28c84aca82d2b42d19a44f3/covid/groups/people/person_distributor.py#L298
we currently bind people to work inside simulated region. Thus we over-write their real-life MSOArea of work so that we can use them in companies within the simulation.

Can you think of a fair argument to change this and implement it?

SchoolDistributor list comprehension

Hi,
I noticed the list comprehension in distribute_non_mandatory_kids_to_school
to consume about 12 % of the run time when running over the NorthEast dataset.

n_pupils_age = len( [ pupil.age for pupil in school.people if pupil.age == person.age ] )
I wonder if the counting could not be accelerated by adding a dictionary atribute to School, such as
`
School(Group):
...

    self.age_min = age_min
    self.age_max = age_max
    self.age_structure = {a:0 for a in range(age_min, age_max+1)}

and then reading from it and incrementing it indistribute_non_mandatory_kids_to_school` as
such:

`
...

                for i in range(0, self.MAX_SCHOOLS):  # look for non full school
                    school = self.closest_schools_by_age[person.age][i]
                    # check number of students in that age group
                    n_pupils_age = school.age_structure[person.age]
                    if school.n_pupils >= school.n_pupils_max or n_pupils_age >= (
                        school.n_pupils_max / (school.age_max - school.age_min)
                    ):
                        schools_full += 1
                    else:
                        break
                if schools_full == self.MAX_SCHOOLS:  # all schools are full
                    continue

                else:  # just keep the school saved in the previous for loop
                    pass
            school.people.append(person)
            school.age_structure[person.age] += 1

Include verboes flag

Shall we include option to create more verbose output to debug world creation.

Logger R0/day counter

Convert sex to a string or enum

person sex is currently 0 for a man and 1 for a woman owing to the original dataset. When this data is parsed it ought to be converted to something human readable to reduce the chance of mistakes.

Tests tests and more tests

Specially classes that are now used for infection (transmission, symptoms, infection and interaction)

Joe bakes a cake and sends it through the post to Everyone

https://github.com/JosephPB/covidmodelling/blob/e857d4de5d8722f834455a7e9f98e7c98b201335/covid/inputs.py#L225

Make age of couples match in a household.

https://github.com/JosephPB/covidmodelling/blob/1ff8a874eda5afac8bae8f0ac387449b931bd09f/covid/groups/households/household_distributor.py#L234

The function create_two_parent_household should take into account the age of the parents, and match them with a certain Gaussian width, for example.

Passing time from world / interaction

The refactor of Symptoms / Transmission assumes the time is passed into their functions, and the Infection class now tracks when the infection begins (it stores a start time) allowing it to pass delta_t to the symptoms / transmission methods.

We need to pass the actual time in the world to Infection so the methods below can create new infections correctly.

    def infect_person_at_time(self, person, time):
        """Infects someone by initializing an infeciton object with the same type
        and parameters as the carrier's infection class.

        Arguments:
            person (Person) has to be an instance of the Person class.
        """

        infection = self.new_infection_at_time(time=time)

        person.health_information.set_infection(infection=infection)

    def update_at_time(self, time):

        if self.last_time_updated <= time:

            delta_time = time - self.start_time

            self.last_time_updated = time
            self.transmission.update_probability_from_delta_time(delta_time=delta_time)
            self.symptoms.update_severity_from_delta_time(delta_time=delta_time)
            self.infection_probability = self.transmission.probability

From what I can tell, this requires us to simply pass the time of the timer down from the update_status_lists() methods that are called in world. I'll implement this behavior tomorrow if needs be alongside attempting to simplify the Interaction class.

Timesteps of 3x 8 hours bug

https://github.com/JosephPB/covidmodelling/blob/6d867421001c342cf29bb2d42dbcba6c31f6b0bd/covid/time.py#L32

If we set 3 timesteps of 8 hours in the config file we get the following error

  File "time.py", line 46, in get_shifts_duration
    return self.time_config['step_duration'][self.type_day][self.shift+1]/24.
KeyError: 4

I think it's still the problem with the floating point precision, i.e. int(1.9999999999999998) = 1

Updating groups is slow

https://github.com/JosephPB/covidmodelling/blob/744a6e80d8e9ec55e90619d4db77f010afd0399a/covid/groups/group.py#L61-L68

The way that we're updating groups is slow because it has to do this list comprehension every time we call this function. It's much slower than the way we were doing before, for example, running on just box mode for 10000 people takes hours rather than minutes.

Maybe it would be better to store this list somewhere rather than calculating it every time?

Contact matrices use within household

In households, contact matrices should reflect the structure of a household. If the parents are older, they should stay have the same number of contacts with their children.

revisit group structure and interplay with interaction models

We need to decompose groups into subgroups that represent different functions in the group and use this in our interaction models.

Assigning people different modes of commuting

There is data on different methods of commuting that can be assigned to people in the person_distibutor.py and added to each person as an attribute.

An example of how the data on commuting is read in is here, but can be edited:

https://github.com/JosephPB/covidmodelling/blob/6d867421001c342cf29bb2d42dbcba6c31f6b0bd/covid/inputs.py#L246

Questions regarding parameter.py

The ParameterInitialize is used to set up the parameters of model components (e.g. Transmission):

class Transmission(ParameterInitializer):
    def __init__(self, timer, user_parameters, required_parameters):
        super().__init__("transmission", user_parameters, required_parameters)
        self.timer = timer
        if timer != None:
            self.infection_start_time = self.timer.now
            self.last_time_updated = self.timer.now  # for testing
        self.probability = 0.0

Via yaml config files. It would be better if this class were instantiated with the parameters as input for its constructor, e.g.:

class Transmission:
    def __init__(self, timer, transmission_probability=0.1):
        
        self.timer = timer
        self.transmission_probability = transmission_probability

        if timer != None:
            self.infection_start_time = self.timer.now

        self.probability = 0.0

Assuming the transmission probability is treated as a constant, there seems no obvious reason to me why it needs to be loaded via yaml configs and the ParameterInitializer. This is just making the class difficult to change and test and depend on unnecessary config files.

The main purpose of the ParameterInitiailizer seems to be for setting up the class with the input parameters (e.g. transmission_probability) not as constant values (e.g. floats) but instead as parameters defined by distributions (Uniform, Gaussian).

Am I correct in thinking that this set up of the parameters is so that they can be set up with 'random values' drawn via the priors for model-fitting (e.g. using a non-linear search). Put another way, once these parameters have physical values (i.e. floats drawn via the distributions) those values are then used for a full simulation (e.g. creating a world, running the simulation, tracking how the infection spreads, etc)?

If this is the case I can simplify all the model classes (transmission, symptons, interaction) by interfacing the code with a probablsistic programming language (I will explain this in more detail in a telecom).

So, I guess my question is are these things distributions for a different reason than the one I described above? Or are they set up as distributions for some aspect of the world simulation that I'm missing?

New Conventions

keep world away from any group related functions and classes
create separate folder for distributors for each group (create distributors package)
stick to geo. area code convention for our simulation: area (for fine granulation), super_area (for coarse granulation)
implement naming conventions in input data: m=male,f=female,super_area=msoa,area=oa,xxx=infinity

Refer to objects by .name or .id

Atm. there are parts of the code that uses a .id attributed to label a object and some that uses .name We should use the same convention throughout and thus decide if we think "the short the better" or otherwise.

World should only initialise what is needed, as given by config file (example schools)

https://github.com/JosephPB/covidmodelling/blob/1ff8a874eda5afac8bae8f0ac387449b931bd09f/covid/world.py#L74

Fix commuter

In covid/groups/areas/area.py the commute_generator calls for_code, but that function seems to not have been implemented. I have commented out everything relating to commuting.

Choose a name for the code, voting closes at Monday 11 am UK time.

Comment here with a suggestion for the code's name.

RULES:

You are allowed to post two suggestions.
You are allowed to vote two suggestions.

Vote by giving a thumbs up.

Deadline: Monday 4th of May, 11 am UK time.

Reward: Two bottles of some of the finest spirits, courtesy of @FrankKrauss and @rgbower .

Hospitals only instantiated in box mode

If I run all the tests together then test_fullrun passes. If I run only this test it fails.

The reason it fails is that it tries to assign a person to hospital but it can't because hospitals don't get created unless box_mode=True.

I assume hospitals should be created regardless of mode?

The inconsistency in whether people are assigned to hospital or not is likely to be due to some file or configuration that is loaded using a path relative to the current working directory. This should be made relative to a file such that tests have the same behaviour regardless of where they are run from.

Tests don't pass on master branch due to missing classes

CollectiveInteraction
InfectionSelector

Are both missing so tests aren't passing. Looks like Maybe CollectiveInteraction was renamed as there is an InteractionCollective class? Although that is missing some methods called in tests.

Check our R0 saved in the logger agrees with SIR model definition

We can see how the R0 in our logger when running our model in box mode compares to the R0 expected for an SIR model. Given that we could do a one to one comparison with our box and the SIR, these should agree.

Test all the interactions in different groups and with different parameters

Check qualitatively how the disease spreads in households only, school only etc... for different interactions and different parameters

Add healthcare workers to hospitals

Assign people tagged as healthcare workers to hospitals following flow data

Add teachers to school

Allocate workers with the teacher tag to schools according to flow data

Add different contact matrices for different groups (Schools, Companies, Households) and combine physical matrices with conversational

https://github.com/JosephPB/covidmodelling/blob/6d867421001c342cf29bb2d42dbcba6c31f6b0bd/covid/groups/group.py#L187

This issue is mostly independent of the rest of the code. It involves understanding contact matrices, extracting them from the BBC pandemic survey paper (https://www.medrxiv.org/content/10.1101/2020.02.16.20023754v2), they can be found at the Supplementary material section. Very importantly, they do not include children younger than 14 as participants of the survey, we need to somehow extrapolate the rest of the matrix. Potentially it could be combined with POLYMOD, as they do in the BBC pandemic paper.

Add 18-19 yr olds not in school to work force

The following line shows you that we skip adding 18-19 yr olds to schools if they are already over their capacity.
https://github.com/JosephPB/covidmodelling/blob/6d867421001c342cf29bb2d42dbcba6c31f6b0bd/covid/groups/schools/school_distributor.py#L66

This means they would have nothing to do and stay at home all day 'n' every day. So let's try to give them work by including as done here:
https://github.com/JosephPB/covidmodelling/blob/6d867421001c342cf29bb2d42dbcba6c31f6b0bd/covid/groups/people/person_distributor.py#L271
and in the functions _assign_industry_specific and _assign_industry in the same file further down.

This means we need to include these functions in the school_distributor.py file to change the person attributes.

Motivation for using setters and getters

Here:
https://josephpb.github.io/covidmodelling/code_structure.html

It stipulates we use setters and getters because of porting to C++. Does this help with some kind of automated porting?

If not I'd be inclined to adopt a more pythonic syntax.

You can use an under score prefix to indicate privacy:

self._attribute = value

Whilst properties can the be used to make the attribute public, but apparently immutable:

@property
def attribute(self):
    return self._attribute

Finally, to make it fully gettable and settable we can use a property setter:

@attribute.setter
def attribute(self, attribute):
    self._attribute = attribute

world.py doesn't run with config_example.yaml

I've changed the python in world.py to replace the non-existent config_companies.yaml (already assigned an issue) with config_example.yaml. Code crashes:
frank@frank-XPS-15-9570:~/covidmodelling/covid$ python3 world.py
Initializing world...
Reading inputs...
Traceback (most recent call last):
File "world.py", line 388, in
world = World(config_file=os.path.join("..", "configs", "config_example.yaml"))
File "world.py", line 33, in init
self.inputs = Inputs(zone=self.config["world"]["zone"])
File "/home/frank/covidmodelling/covid/inputs.py", line 60, in init
np.unique(self.oa2msoa_df["MSOA11CD"].values)
File "/home/frank/covidmodelling/covid/inputs.py", line 150, in read_companysize_census
header=0,
File "/home/frank/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/frank/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/home/frank/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 880, in init
self._make_engine(self.engine)
File "/home/frank/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/frank/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1891, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 374, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File /home/frank/covidmodelling/covid/../data/census_data/middle_output_area/EnglandWales/companysize_msoa11cd_2019.csv does not exist: '/home/frank/covidmodelling/covid/../data/census_data/middle_output_area/EnglandWales/companysize_msoa11cd_2019.csv'

globals() is used to load a class

globals() is used so a class can be instantiated by name. This is bad practice because it means removing an apparently unused import breaks the code. It also violates the philosophy of making behaviours explicit in python.

It would be better to define or generate a dictionary explicitly mapping class names to classes.

Create more dynamic households

Households currently only have certain compositions, e.g. single parent, two parent, student or elderly couple. This means that inter-generational mixing does not occur over more than one generation, and that is only in cases where children live with their parents.

It would be good to create another 'generic' household function that looks at the data distributions and matches people across 2 generations (children living with parents and some grandparents or sub-sets thereof).

The starting point for this would be to look at the household data currently used and how such distributions could be drawn.

Distributing households can be found here:

https://github.com/JosephPB/covidmodelling/blob/master/covid/groups/households/household_distributor.py

Test hospital class

Add unit tests to check that people are rightly moved to hospitals, and de-activated from other groups

Re-work on households

Use household composition dataset (Nomis QS113UK) to define a "minimal" configuration of households.
Use remaining people to populate households in which they could potentially be in. This will allow multi-generational households.
Match age of couples given https://en.wikipedia.org/wiki/Age_disparity_in_sexual_relationships , which is almost identical to the UK one, and can be more generally use for other countries.
Match age of mother and child given by https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/birthsbyparentscharacteristics . The same dataset also allows for matching siblings age.
Create care homes using https://www.nomisweb.co.uk/census/2011/qs421uk . First populate them with the oldest people in that output area.
Use Nomis tables QS406EW (household size) and QS110EW (household composition per age) as a test of our algorithm.

config_companies.yaml is missing in repository

Test all the interactions within box

Test both the collective (as in Imperial and our own) and contact matrix interaction in the box setting. If possible, compare with SIR model fit. Specially check that the R0 in the logger has a correspondence with SIR parameters.

Set up git LFS

So, to get rid of the data.zip annoyances, we should set up git lfs, as we did last time.

https://git-lfs.github.com/

The pricing seems to be 5$ for every 50 GB of bandwidth we use per month.

Create Geography class.

Currently, it is hard to specify exactly which geography you want on the code, and you always need to initialize world. As a first step, one should be able to initialize just the "geography" part of the code like this

area1 = Area.from_name("E0000XXX")
area2 = Area.from_name("E1231321")
area3 = Area(n_residents=50, coordinates=....)
areas = Areas([area1, area2, area3 ])
geography = Geography(areas)
world = World(geography = geography)

Of course it would also be possible to do

geography = Geography.from_region("NorthEast")

which would initialize all the output areas and super areas in the north east.

I will start working on this as it is needed to make other things work and testing feasible.

This can be made faster if outside the loop

https://github.com/JosephPB/covidmodelling/blob/c435b9854b177f0a5730e18f5bf02b04e41959a9/covid/groups/people/person_distributor.py#L341

Test company class

move severely ill people to hospital

if people are to be hospitalised, move them to the nearest hospital with capacity in their region.

step: create monster hospital and move everyone there, check that they are out of work place and home.
step: load hospitals, find nearest one
step: make sure it's the nearest one with capacity
step: deal with ICU and ICU capacities.

People should stay at home if they have symptoms, and go to hospital if necessary

Adapt the code to people possibly staying at home if they have symptoms (perhaps with a certain probability configured in the config file?), and send them to hospital if necessary (so that they are not active inside the household)