Git Product home page Git Product logo

flowsa's Introduction

FLOWSA Paper DOI

flowsa

flowsa is a data processing library attributing the flows of resources (environmental, monetary, and human), wastes, emissions, and losses to sectors, typically NAICS codes. flowsa aggregates, combines, and allocates data from a variety of sources. The sources can be found in the GitHub wiki under "Flow-By-Activity Datasets".

flowsa helps support USEEIO as part of the USEEIO modeling framework. The USEEIO models estimate potential impacts of goods and services in the US economy. The Flow-By-Sector datasets created in FLOWSA are the environmental inputs to useeior.

Usage

Flow-By-Activity (FBA) Datasets

Flow-By-Activity datasets are formatted tables from a variety of sources. They are largely unchanged from the original data source, except for formatting. A list of available FBA datasets can be found in the Wiki.

import flowsa
Return list of all available FBA datasets, including years flowsa.seeAvailableFlowByModels('FBA')
Generate and return pandas dataframe for 2014 Energy Information Administration (EIA) Manufacturing Energy Consumption Survey (MECS) land use
fba = flowsa.getFlowByActivity(datasource="EIA_MECS_Land", year=2014)

Flow-By-Sector (FBS) Datasets

Flow-By-Sector datasets are tables of environmental and other data attributed to sectors. A list of available FBS datasets can be found in the Wiki.

import flowsa
Return list of all available FBS datasets flowsa.seeAvailableFlowByModels('FBS')
Generate and return pandas dataframe for national water withdrawals attributed to 6-digit sectors. Download all required FBA datasets from Data Commons.
fbs = flowsa.getFlowBySector('Water_national_2015_m1', download_FBAs_if_missing=True)

Examples

Additional example code can be found in the examples folder.

Installation

pip install git+https://github.com/USEPA/[email protected]#egg=flowsa

where vX.X.X can be replaced with the version you wish to install under Releases.

Additional Information on Installation, Examples, Detailed Documentation

For more information on flowsa see the wiki.

Accessing datsets output by FLOWSA

FBA and FBS datasets can be accessed on EPA's Data Commons without running the Python code.

Disclaimer

The United States Environmental Protection Agency (EPA) GitHub project code is provided on an "as is" basis and the user assumes responsibility for its use. EPA has relinquished control of the information and no longer has responsibility to protect the integrity, confidentiality, or availability of the information. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by EPA. The EPA seal and logo shall not be used in any manner to imply endorsement of any commercial product or activity by EPA or the United States Government.

flowsa's People

Contributors

a-w-beck avatar andychase avatar bl-young avatar catherinebirney avatar cchiq avatar davidemeyer avatar dyoung11 avatar ealonso-mfa avatar ericmbell1 avatar jacobgqc avatar jbousquin avatar jchou18 avatar matthewlchambers avatar melissagqc avatar moli7 avatar rwashing523 avatar wesingwersen avatar ysrivas08 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flowsa's Issues

esupy 'dependency conflict'

Howdy,
I'm receiving the following error when following the installation instructions.

Collecting esupy@ git+git://github.com/USEPA/[email protected]#egg=esupy
  Cloning git://github.com/USEPA/esupy (to revision v0.1.7) to /tmp/pip-install-v0p7lwqy/esupy_0e824dc63ad04367b1ae3e24b3bdde5d
  Running command git clone -q git://github.com/USEPA/esupy /tmp/pip-install-v0p7lwqy/esupy_0e824dc63ad04367b1ae3e24b3bdde5d
  Running command git checkout -q c04efa5aefc82a317776a2b2b1b20fae01b5fce7
INFO: pip is looking at multiple versions of fedelemflowlist to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of esupy to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of flowsa to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install flowsa and flowsa==0.2.1 because these package versions have conflicting dependencies.

The conflict is caused by:
    flowsa 0.2.1 depends on esupy 0.1.7 (from git+https://github.com/USEPA/[email protected]#egg=esupy)
    stewi 0.9.9 depends on esupy 0.1.7 (from git+git://github.com/USEPA/[email protected]#egg=esupy)

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

steps to recreate

$ mkdir flowsa-test
$ python3 -m venv flowsa-venv
$ source flowsa-venv/bin/activate
$ pip install git+https://github.com/USEPA/flowsa

pip version 21.0.1
python version 3.9.6
Following the installation instructions outside of a virtual environment was also unsuccessful.

I was able to successfully install a fork by changing the 'git' to 'https' in setup.py but then I ran into more import problems and wasn't sure if i was causing more problems than I was fixing.

Need warning when attempting to obtain data for unavailable year

fba = flowsa.getFlowByActivity(datasource="EIA_MECS_Energy", year=2017)
returns a bad zip file error because 2017 doesn't exist for EIA_MECS

Perhaps flowbyactivity could check against the list of years in the FBA yaml. This would also help prevent errors when new data are released (e.g. 2018 for EIA_MECS) but not yet tested

Underutilization of `.yaml`s

Raising this as an issue now for awareness and discussion. I think we are currently hardcoding too many things in the data source script .py files and not making enough use of the .yaml files. For example, information on how to modify the NEI data url for different years, or information on how columns should be renamed (again, I'm thinking about the NEI data source in particular at the moment) and how the renaming depends on the year, could be included as dicts in the EPA_NEI.yaml files.

One advantage of offloading this information to the .yaml files is that it should simplify the process of adding new data years (e.g. the 2020 NEI, or 2018 and 2019 EQUATES data).

I'm working on the FBA methods and scripts for the EQUATES data right now, and I'll link to them once I'm finished, by way of an example of what I'm thinking.

URL concatentation missing '&' when forming Quickstats

When indirectly calling on the code to generate the FBA for the CoA Cropland, it creates this URL where the xxxx is a valid API key

2022-01-11 10:50:10 INFO     Calling https://quickstats.nass.usda.gov/api/api_GET/?key=xxxxEsource_desc=CENSUS&sector_desc=ECONOMICS&statisticcat_desc=AREA%26statisticcat_desc%3DAREA+OPERATED&commodity_desc=AG+LAND%26commodity_desc%3DFARM+OPERATIONS&unit_desc=ACRES%26unit_desc%3DOPERATIONS&agg_level_desc=NATIONAL&year=2017
ERROR Error in URL request!

I can see that after the key value there is no ampersand to indicate the next URL parameter

Error when all flows are 0 for an activity

For a particular activity, all flows are reported as zero. When that flow_subset is passed to agg_by_geoscale (here), all flows that are 0 get dropped after aggregating. This generates an error later when sectors are added to that flow_subset.

Check that NAICS coming in from NAICS like sources is present in our NAICS list for the given year

This is an issue discovered by @bl-young with a NAICS code coming in from RCRA via stewi that was not a valid 2012 NAICS code.
Related issue is here USEPA/useeior#83

In any NAICS like sources we need to check that NAICS codes are in our NAICS code list. If they are not present, we probably need to check if they are present in an older NAICS schema and see if we can apply a mapping to get it into the current NAICS schema (2012 at the moment)

Air transportation emissions (NEI) misaligned

The NEI captures emissions from landings and takeoffs (LTO) which are assigned to airports in the NEI point dataset. In most cases, these would end up assigned to NAICS that would get assigned to 48A000 - Scenic and sightseeing transportation and support activities for transportation. Instead they should be assigned to Air Transportation (481000)

See NEI TSD: https://www.epa.gov/sites/production/files/2018-06/documents/nei2014v2_tsd_09may2018.pdf; section 3.2

Emissions from LTO are noted by specific SCCs, primarily 2275020000

pandas.np deprecation

Starting here and used in this function is pd.np.where

df.loc[:, 'FlowName'] = pd.np.where(df.Description.str.contains("fresh"), "fresh",

Its throwing a warning of future deprecation.
FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead

USDA_CoA_Cropland has mixed flow classes

The OPERATIONS flows are not a unit of Land, but rather of Other, so these should be stored in a separate parquet file. at this point flowsa doesn't support having more than one class stored together in a parquet. If that needs to change then we need to change source catalog to have a list for class instead of a string

materialflowlist import error

Running "CRHW_national_2017" returns error:

import materialflowlist as mfl

ModuleNotFoundError: No module named 'materialflowlist'

@bl-young We need to add materialflowlist to package requirements/setup. Do you have a specific version? Or did you want to make this package an optional requirement?

Consider only allowing getFlowByActivity to take a single year and use an int

This function as of now is just messy to send a list as opposed to a single year when we don't have any use cases for multiple years.
It would be just as easy for the user to call the function multiple times for multiple years (like in a loop) and concatenate.

If it takes a single year it will also be more consistent with getFlowBySector.

get fba subset issue

When using USDA crop data as allocation source, need to aggregate down from 7 --> 6 digit NAICs for cases when the crosswalk is already based on a six digit NAICS.

Similarly need to aggregate up from 5 --> 6.

That is, an activity that is split between two NAICS eg: 111140 & 111920 in the crosswalk, will get missed in the get_fba_allocation_subset because it only appears as 11192.
Or a crosswalk for 111120 will get missed as these only appear as 111120A

NAICS in stewicombo highly sensitive to FRS data

Assigning stewicombo data to sectors uses FRS NAICS assignments. In some cases, multiple NAICS are listed for an individual dataset. The first listed NAICS for a specific inventory in the FRS system is used.

In some cases the available NAICS are wildly different.

# use NAICS from facility matcher so drop them here
facility_mapping.drop(columns=['NAICS'], inplace=True)
# merge dataframes to assign facility information based on facility IDs
df = pd.merge(df, facility_mapping, how='left',
on='FacilityID')
all_NAICS = obtain_NAICS_from_facility_matcher(inventory_list)
df = pd.merge(df, all_NAICS, how='left', on=['FRS_ID', 'Source'])

Transportation Satellite Account FBA not generating properly

I've verified that the TSA data frame is correctly parsed by the tsa_parse() function from BTS_TSA.py, but then the FBA is empty. So somewhere between the data frame being parsed and the final FBA being written, all the rows are dropped. I've spent some time on it, but I may need some help figuring out this issue.

flownames in EPA GHG fbas

FlowNames stored in GHG fbas need to be corrected
Flownames in the GHGI generally are CO2, CH4, N2O for main gases

reported from stored fba files

examples:
in T_2_1: Recent_Trends... this must be replaced with the gas name like above
in T_3_10: CH4 Emissions from Stationary Combustion should be CH4

These all need to be reviewed and fixed.
fyi @catherinebirney

XLRD issue.

File "C:\Users\MelissaC\Envs\flowsa\lib\site-packages\xlrd_init_.py", line 170, in open_workbook
raise XLRDError(FILE_FORMAT_DESCRIPTIONS[file_format]+'; not supported')
xlrd.biffh.XLRDError: Excel xlsx file; not supported

Occurred when running --year 2012 --source EIA_CBECS_Land.

Remove class param from flowbyactivity

The class parameter and in a list format that is awkward and perhaps not necessary to have in fba. We could easily remove the class requirement and users could apply a simple subset command if they wanted to get only one class.

dynamically_import_fxn breaking for stewi

Accessing the FBS_outside_flowsa functions through the dynamic import is failing for stewi related data. I think the yaml's need to be revised

ModuleNotFoundError: No module named 'flowsa.data_source_scripts.stewicombo'

Allow for more than 2 FBAs in allocation of an FBS activity set

Current method only allows for "allocation_source" and "helper_source" FBAs for FBS activity set allocation. Modify the FBS yaml to allow for unlimited FBAs to be called for allocation. Changing methodology will make methodology more transparent and limit the number of FBAs hardcoded into cleaning functions.

Import error: No module named 'stewi.globals'; 'stewi' is not a package

After installing with pip, attempting to access from flowsa:
...
File "C:\Users\cbirney\git_projects\flowsa\flowsa\stewi.py", line 31, in stewicombo_to_sector
import stewicombo
File "C:\Users\cbirney\AppData\Local\Programs\Python\Python37\lib\site-packages\stewicombo_init_.py", line 5, in
from stewicombo.overlaphandler import aggregate_and_remove_overlap
File "C:\Users\cbirney\AppData\Local\Programs\Python\Python37\lib\site-packages\stewicombo\overlaphandler.py", line 3, in
from stewi.globals import log
ModuleNotFoundError: No module named 'stewi.globals'; 'stewi' is not a package

It appears that duplicate globals.py could be causing an issue here?
cc: @catherinebirney

export log to txt

Given the amount of information in the logger, it would be nice to save this output after generating an FBS

FBS: some NAICS dropping

When expanding the NAICS list, more detailed sectors will get dropped if another entry in the mapping already maps to those sectors. Line 119-120

With NEI data, some activities are more specific than others. E.g. an SCC might apply to all Agriculture (NAICS: 111) while another SCC will apply to specific crops (NAICS: 11114). When both are present, the first SCC won't get assigned to 11114 because it drops out from the mapping.

My preference would be to exclude this step but I dont know how that would impact other mappings. @catherinebirney

Possible infinite loop in common.py

In common.find_true_file_path(), if the directory, filename, and extension do not combine into a valid path, and if removing chunks of the filename following _ does not lead to a valid path (e.g. due to a typo, or an error in the directory or extension), the function can get stuck in an infinite loop.

Permit user passed FBS YAML

Allow flow-by-sector YAMLs to be kept outside of the package and passed for processing in a getFlowBySector() call. This will allow development of FBS methods outside of the main flowsa repo

parquet version number based on user's egg-info

The version number attached to a FBA or FBS parquet is based on the version in a user's directory flowsa.egg-info/PKG-INFO. The pkg-info is created when a user runs pip installl -e flowsa.

To update egg-info, a user can run: python setup.py bdist_egg

Consider making an updateable version parameter that is called on when naming parquets?

@WesIngwersen @bl-young

Joblib cache arbitrarily located, can fail

If the joblib cache is located in '.cache', then it attempts to create the '.cache' directory in the current working directory, wherever that may be. If the user does not have permissions there, then importing flowsa fails.

Inconsistent use of source names

Ideally the sources need to be consistent across these uses:
in fba data 'SourceName' column
in the file names of the parquet
in the Crosswalk file names
in the Source Catalog

The major issue is that we have generally have a provider, an inventory/report, and a specific table or flow type.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.