Interpolation can be slow if the number of cycles is high. Pandas interpolate only use

Comments (2)

ardunn commented on September 28, 2024

We should also look into profiling the code for slow/memory-hogging spots, in case interpolation isn't the main culprit

from beep.

ardunn commented on September 28, 2024

I did some memory profiling of the structuring methods. Take it with a grain of salt, according to this stackoverflow, memory_profiler can be inaccurate wrt. loops because of OS chunking

Problem areas

interpolate_step
interpolate_cycles
interpolate_diagnostic_cycles

Full output of memory profiling while structuring:

Raw file size on disk is 140MB. Size of raw loaded dataframe is 179MB.

SIZEOF: raw_data (): 179.946804 MB
MEM: GETTING STRUCTURING PARAMETERS
Filename: /Users/ardunn/alex/tri/code/beep/beep/structure/base.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   123    434.6 MiB    434.6 MiB           1               def wrapper(*args, **kwargs):
   124    434.6 MiB      0.0 MiB           1                   if args[0]._is_legacy:
   125                                                             raise ValueError(
   126                                                                 f"{args[0].__class__.__name__} is deserialized from a legacy file! Operation not allowed."
   127                                                             )
   128                                                         else:
   129    299.1 MiB   -135.5 MiB           1                       return func(*args, **kwargs)


MEM: Structuring with parameters
Filename: /Users/ardunn/alex/tri/code/beep/beep/structure/base.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   974    299.1 MiB    299.1 MiB           1       @profile
   975                                             def summarize_diagnostic(self, diagnostic_available):
   976                                                 """
   977                                                 Gets summary statistics for data according to location of
   978                                                 diagnostic cycles in the data
   979                                         
   980                                                 Args:
   981                                                     diagnostic_available (dict): dictionary with diagnostic_types
   982                                                         as list, 'length' of the diagnostic in cycles and location
   983                                                         of the diagnostic by cycle index
   984                                         
   985                                                 Returns:
   986                                                     (DataFrame) of summary statistics by cycle
   987                                         
   988                                                 """
   989                                         
   990    299.1 MiB      0.0 MiB           1           max_cycle = self.raw_data.cycle_index.max()
   991                                                 starts_at = [
   992    299.1 MiB      0.0 MiB           7               i for i in diagnostic_available["diagnostic_starts_at"] if i <= max_cycle
   993                                                 ]
   994    299.1 MiB      0.0 MiB           1           diag_cycles_at = list(
   995    299.1 MiB      0.0 MiB           1               itertools.chain.from_iterable(
   996    299.1 MiB      0.0 MiB           7                   [list(range(i, i + diagnostic_available["length"])) for i in starts_at]
   997                                                     )
   998                                                 )
   999    305.4 MiB      6.3 MiB           1           diag_summary = self.raw_data.groupby("cycle_index").agg(self._diag_aggregation)
  1000                                         
  1001    305.4 MiB      0.0 MiB           1           diag_summary.columns = self._diag_summary_cols
  1002                                         
  1003    305.4 MiB      0.0 MiB           1           diag_summary = diag_summary[diag_summary.index.isin(diag_cycles_at)]
  1004                                         
  1005                                                 diag_summary["coulombic_efficiency"] = (
  1006    305.4 MiB      0.0 MiB           1               diag_summary["discharge_capacity"] / diag_summary["charge_capacity"]
  1007                                                 )
  1008    305.4 MiB      0.0 MiB           1           diag_summary["paused"] = self.raw_data.groupby("cycle_index").apply(
  1009    427.6 MiB    122.2 MiB           1               get_max_paused_over_threshold
  1010                                                 )
  1011                                         
  1012    427.6 MiB      0.0 MiB           1           diag_summary.reset_index(drop=True, inplace=True)
  1013                                         
  1014    427.6 MiB      0.0 MiB           1           diag_summary["cycle_type"] = pd.Series(
  1015    427.6 MiB      0.0 MiB           1               diagnostic_available["cycle_type"] * len(starts_at)
  1016                                                 )
  1017                                         
  1018    427.6 MiB      0.0 MiB           1           diag_summary = self._cast_dtypes(diag_summary, "diagnostic_summary")
  1019                                         
  1020    427.6 MiB      0.0 MiB           1           return diag_summary


SIZEOF: diagnostic_summary (): 0.003069 MB
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 168/168 [00:45<00:00,  3.73it/s]
Filename: /Users/ardunn/alex/tri/code/beep/beep/structure/base.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   842    427.6 MiB    427.6 MiB           1       @profile
   843                                             # equivalent of get_interpolated_diagnostic_cycles
   844                                             def interpolate_diagnostic_cycles(
   845                                                     self, diagnostic_available, resolution=1000, v_resolution=0.0005
   846                                             ):
   847                                                 """
   848                                                 Interpolates data according to location and type of diagnostic
   849                                                 cycles in the data
   850                                         
   851                                                 Args:
   852                                                     diagnostic_available (dict): dictionary with diagnostic_types
   853                                                         as list, 'length' of the diagnostic in cycles and location
   854                                                         of the diagnostic
   855                                                     resolution (int): resolution of interpolation
   856                                                     v_resolution (float): voltage delta to set for range based interpolation
   857                                         
   858                                                 Returns:
   859                                                     (pd.DataFrame) of interpolated diagnostic steps by step and cycle
   860                                         
   861                                                 """
   862                                                 # Get the project name and the parameter file for the diagnostic
   863    427.6 MiB      0.0 MiB           1           project_name_list = parameters_lookup.get_project_sequence(self.paths["raw"])
   864    427.6 MiB      0.0 MiB           1           diag_path = os.path.join(MODULE_DIR, "procedure_templates")
   865    427.6 MiB      0.0 MiB           1           v_range = parameters_lookup.get_diagnostic_parameters(
   866    427.6 MiB      0.0 MiB           1               diagnostic_available, diag_path, project_name_list[0]
   867                                                 )
   868                                         
   869                                                 # Determine the cycles and types of the diagnostic cycles
   870    427.6 MiB      0.0 MiB           1           max_cycle = self.raw_data.cycle_index.max()
   871                                                 starts_at = [
   872    427.6 MiB      0.0 MiB           7               i for i in diagnostic_available["diagnostic_starts_at"] if i <= max_cycle
   873                                                 ]
   874    427.6 MiB      0.0 MiB           1           diag_cycles_at = list(
   875    427.6 MiB      0.0 MiB           1               itertools.chain.from_iterable(
   876    427.6 MiB      0.0 MiB           7                   [range(i, i + diagnostic_available["length"]) for i in starts_at]
   877                                                     )
   878                                                 )
   879                                                 # Duplicate cycle type list end to end for each starting index
   880    427.6 MiB      0.0 MiB           1           diag_cycle_type = diagnostic_available["cycle_type"] * len(starts_at)
   881    427.6 MiB      0.0 MiB           1           if not len(diag_cycles_at) == len(diag_cycle_type):
   882                                                     errmsg = (
   883                                                         "Diagnostic cycles, {}, and diagnostic cycle types, "
   884                                                         "{}, are unequal lengths".format(diag_cycles_at, diag_cycle_type)
   885                                                     )
   886                                                     raise ValueError(errmsg)
   887                                         
   888    435.4 MiB      7.8 MiB           1           diag_data = self.raw_data[self.raw_data["cycle_index"].isin(diag_cycles_at)]
   889                                         
   890                                                 # Counter to ensure non-contiguous repeats of step_index
   891                                                 # within same cycle_index are grouped separately
   892    439.2 MiB      3.7 MiB           1           diag_data.loc[:, "step_index_counter"] = 0
   893                                         
   894    469.9 MiB  -1064.2 MiB          21           for cycle_index in diag_cycles_at:
   895    458.1 MiB   -990.3 MiB          20               indices = diag_data.loc[diag_data.cycle_index == cycle_index].index
   896    469.9 MiB   -821.1 MiB          20               step_index_list = diag_data.step_index.loc[indices]
   897    469.9 MiB  -1057.6 MiB          20               diag_data.loc[indices, "step_index_counter"] = step_index_list.ne(
   898    469.9 MiB  -1064.0 MiB          20                   step_index_list.shift()
   899                                                     ).cumsum()
   900                                         
   901    406.6 MiB    -63.3 MiB           1           group = diag_data.groupby(["cycle_index", "step_index", "step_index_counter"])
   902                                                 incl_columns = [
   903    406.6 MiB      0.0 MiB           1               "current",
   904    406.6 MiB      0.0 MiB           1               "charge_capacity",
   905    406.6 MiB      0.0 MiB           1               "discharge_capacity",
   906    406.6 MiB      0.0 MiB           1               "charge_energy",
   907    406.6 MiB      0.0 MiB           1               "discharge_energy",
   908    406.6 MiB      0.0 MiB           1               "internal_resistance",
   909    406.6 MiB      0.0 MiB           1               "temperature",
   910    406.6 MiB      0.0 MiB           1               "test_time",
   911                                                 ]
   912                                         
   913    406.6 MiB      0.0 MiB           1           diag_dict = {}
   914    424.0 MiB    -67.3 MiB          18           for cycle in diag_data.cycle_index.unique():
   915    424.0 MiB    -52.6 MiB          17               diag_dict.update({cycle: None})
   916    424.0 MiB    -47.7 MiB          17               steps = diag_data[diag_data.cycle_index == cycle].step_index.unique()
   917    424.0 MiB    -66.2 MiB          17               diag_dict[cycle] = list(steps)
   918                                         
   919    410.4 MiB    -13.6 MiB           1           all_dfs = []
   920    551.9 MiB     75.3 MiB         169           for (cycle_index, step_index, step_index_counter), df in tqdm(group):
   921    551.9 MiB    -59.7 MiB         168               if len(df) < 2:
   922                                                         continue
   923    551.9 MiB    -59.7 MiB         168               if diag_cycle_type[diag_cycles_at.index(cycle_index)] == "hppc":
   924    551.9 MiB    -52.3 MiB         142                   v_hppc_step = [df.voltage.min(), df.voltage.max()]
   925    551.9 MiB    -50.6 MiB         142                   hppc_resolution = int(
   926    551.9 MiB    -50.6 MiB         142                       (df.voltage.max() - df.voltage.min()) / v_resolution
   927                                                         )
   928    551.9 MiB    -50.6 MiB         142                   new_df = interpolate_df(
   929    551.9 MiB    -50.6 MiB         142                       df,
   930    551.9 MiB    -50.6 MiB         142                       field_name="voltage",
   931    551.9 MiB    -50.6 MiB         142                       field_range=v_hppc_step,
   932    551.9 MiB    -50.6 MiB         142                       columns=incl_columns,
   933    551.9 MiB    -47.5 MiB         142                       resolution=hppc_resolution,
   934                                                         )
   935                                                     else:
   936    551.7 MiB     -7.4 MiB          26                   new_df = interpolate_df(
   937    551.7 MiB     -0.4 MiB          26                       df,
   938    551.7 MiB     -0.4 MiB          26                       field_name="voltage",
   939    551.7 MiB     -0.4 MiB          26                       field_range=v_range,
   940    551.7 MiB     -0.4 MiB          26                       columns=incl_columns,
   941    551.8 MiB      2.2 MiB          26                       resolution=resolution,
   942                                                         )
   943                                         
   944    551.9 MiB    -49.6 MiB         168               new_df["cycle_index"] = cycle_index
   945    551.9 MiB    -59.4 MiB         168               new_df["cycle_type"] = diag_cycle_type[diag_cycles_at.index(cycle_index)]
   946    551.9 MiB    -59.4 MiB         168               new_df["step_index"] = step_index
   947    551.9 MiB    -59.4 MiB         168               new_df["step_index_counter"] = step_index_counter
   948    551.9 MiB    -59.4 MiB         168               new_df["step_type"] = diag_dict[cycle_index].index(step_index)
   949    551.9 MiB    -59.4 MiB         168               new_df.astype(
   950                                                         {
   951    551.9 MiB    -59.4 MiB         168                       "cycle_index": "int32",
   952    551.9 MiB    -59.4 MiB         168                       "cycle_type": "category",
   953    551.9 MiB    -59.4 MiB         168                       "step_index": "uint8",
   954    551.9 MiB    -59.4 MiB         168                       "step_index_counter": "int16",
   955    551.9 MiB    -58.8 MiB         168                       "step_type": "uint8",
   956                                                         }
   957                                                     )
   958                                                     new_df["discharge_dQdV"] = (
   959    551.9 MiB    -59.5 MiB         168                   new_df.discharge_capacity.diff() / new_df.voltage.diff()
   960                                                     )
   961                                                     new_df["charge_dQdV"] = (
   962    551.9 MiB    -59.5 MiB         168                   new_df.charge_capacity.diff() / new_df.voltage.diff()
   963                                                     )
   964    551.9 MiB    -59.5 MiB         168               all_dfs.append(new_df)
   965                                         
   966                                                 # Ignore the index to avoid issues with overlapping voltages
   967    557.7 MiB      5.8 MiB           1           result = pd.concat(all_dfs, ignore_index=True)
   968                                                 # Cycle_index gets a little weird about typing, so round it here
   969    558.1 MiB      0.3 MiB           1           result.cycle_index = result.cycle_index.round()
   970    555.1 MiB     -3.0 MiB           1           result = self._cast_dtypes(result, "diagnostic_interpolated")
   971                                         
   972    555.1 MiB      0.0 MiB           1           return result


SIZEOF: diagnostic_summary (): 3.289344 MB
SIZEOF: cycle_indices in interpolate_step (): 0.002184 MB
Interpolating discharge (2.5 - 4.2)V (1000 points): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 231/231 [01:15<00:00,  3.08it/s]
Filename: /Users/ardunn/alex/tri/code/beep/beep/structure/base.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   512    463.8 MiB    463.8 MiB           1       @profile
   513                                             def interpolate_step(
   514                                                     self,
   515                                                     v_range,
   516                                                     resolution,
   517                                                     step_type="discharge",
   518                                                     reg_cycles=None,
   519                                                     axis="voltage",
   520                                                     desc=None
   521                                             ):
   522                                                 """
   523                                                 Gets interpolated cycles for the step specified, charge or discharge.
   524                                         
   525                                                 Args:
   526                                                     v_range ([Float, Float]): list of two floats that define
   527                                                         the voltage interpolation range endpoints.
   528                                                     resolution (int): resolution of interpolated data.
   529                                                     step_type (str): which step to interpolate i.e. 'charge' or 'discharge'
   530                                                     reg_cycles (list): list containing cycle indicies of regular cycles
   531                                                     axis (str): which column to use for interpolation
   532                                                     desc (str): Description to print to tqdm column.
   533                                         
   534                                                 Returns:
   535                                                     pandas.DataFrame: DataFrame corresponding to interpolated values.
   536                                                 """
   537                                         
   538    463.8 MiB      0.0 MiB           1           if not desc:
   539                                                     desc = \
   540    463.8 MiB      0.0 MiB           1                   f"Interpolating {step_type} ({v_range[0]} - {v_range[1]})V " \
   541                                                         f"({resolution} points)"
   542                                         
   543    463.8 MiB      0.0 MiB           1           if step_type == "discharge":
   544    463.8 MiB      0.0 MiB           1               step_filter = step_is_dchg
   545                                                 elif step_type == "charge":
   546                                                     step_filter = step_is_chg
   547                                                 else:
   548                                                     raise ValueError("{} is not a recognized step type")
   549                                                 incl_columns = [
   550    463.8 MiB      0.0 MiB           1               "test_time",
   551    463.8 MiB      0.0 MiB           1               "voltage",
   552    463.8 MiB      0.0 MiB           1               "current",
   553    463.8 MiB      0.0 MiB           1               "charge_capacity",
   554    463.8 MiB      0.0 MiB           1               "discharge_capacity",
   555    463.8 MiB      0.0 MiB           1               "charge_energy",
   556    463.8 MiB      0.0 MiB           1               "discharge_energy",
   557    463.8 MiB      0.0 MiB           1               "internal_resistance",
   558    463.8 MiB      0.0 MiB           1               "temperature",
   559                                                 ]
   560    463.8 MiB      0.0 MiB           1           all_dfs = []
   561    465.0 MiB      1.2 MiB           1           cycle_indices = self.raw_data.cycle_index.unique()
   562    465.0 MiB      0.0 MiB         251           cycle_indices = sorted([c for c in cycle_indices if c in reg_cycles])
   563                                         
   564                                         
   565    465.0 MiB      0.0 MiB           1           pm(cycle_indices, "cycle_indices in interpolate_step")
   566                                         
   567    465.0 MiB      0.0 MiB         232           for cycle_index in tqdm(cycle_indices, desc=desc):
   568                                                     # Use a cycle_index mask instead of a global groupby to save memory
   569                                                     new_df = (
   570    465.3 MiB  -2537.3 MiB         231                   self.raw_data.loc[self.raw_data["cycle_index"] == cycle_index]
   571    465.3 MiB  -2608.6 MiB         231                       .groupby("step_index")
   572    459.9 MiB  -2619.5 MiB         231                       .filter(step_filter)
   573                                                     )
   574    459.9 MiB   -372.6 MiB         231               if new_df.size == 0:
   575                                                         continue
   576                                         
   577    459.9 MiB   -372.6 MiB         231               if axis in ["charge_capacity", "discharge_capacity"]:
   578                                                         axis_range = [self.raw_data[axis].min(),
   579                                                                       self.raw_data[axis].max()]
   580                                                         new_df = interpolate_df(
   581                                                             new_df,
   582                                                             axis,
   583                                                             field_range=axis_range,
   584                                                             columns=incl_columns,
   585                                                             resolution=resolution,
   586                                                         )
   587    459.9 MiB   -372.6 MiB         231               elif axis == "test_time":
   588                                                         axis_range = [new_df[axis].min(), new_df[axis].max()]
   589                                                         new_df = interpolate_df(
   590                                                             new_df,
   591                                                             axis,
   592                                                             field_range=axis_range,
   593                                                             columns=incl_columns,
   594                                                             resolution=resolution,
   595                                                         )
   596    459.9 MiB   -372.6 MiB         231               elif axis == "voltage":
   597    459.9 MiB   -372.6 MiB         231                   new_df = interpolate_df(
   598    459.9 MiB   -372.6 MiB         231                       new_df,
   599    459.9 MiB   -372.6 MiB         231                       axis,
   600    459.9 MiB   -372.6 MiB         231                       field_range=v_range,
   601    459.9 MiB   -372.6 MiB         231                       columns=incl_columns,
   602    460.0 MiB   -366.9 MiB         231                       resolution=resolution,
   603                                                         )
   604                                                     else:
   605                                                         raise ValueError(f"Axis {axis} not a valid step interpolation axis.")
   606    460.0 MiB      0.0 MiB         231               new_df["cycle_index"] = cycle_index
   607    460.0 MiB      0.0 MiB         231               new_df["step_type"] = step_type
   608    460.0 MiB      0.0 MiB         231               new_df["step_type"] = new_df["step_type"].astype("category")
   609    460.0 MiB      0.0 MiB         231               all_dfs.append(new_df)
   610                                         
   611                                                 # Ignore the index to avoid issues with overlapping voltages
   612    474.5 MiB      9.5 MiB           1           result = pd.concat(all_dfs, ignore_index=True)
   613                                         
   614                                                 # Cycle_index gets a little weird about typing, so round it here
   615    475.3 MiB      0.8 MiB           1           result.cycle_index = result.cycle_index.round()
   616                                         
   617    475.3 MiB      0.0 MiB           1           return result


SIZEOF: cycle_indices in interpolate_step (): 0.002184 MB
Interpolating charge (2.5 - 4.2)V (1000 points): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 231/231 [01:00<00:00,  3.81it/s]
Filename: /Users/ardunn/alex/tri/code/beep/beep/structure/base.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   512    463.7 MiB    463.7 MiB           1       @profile
   513                                             def interpolate_step(
   514                                                     self,
   515                                                     v_range,
   516                                                     resolution,
   517                                                     step_type="discharge",
   518                                                     reg_cycles=None,
   519                                                     axis="voltage",
   520                                                     desc=None
   521                                             ):
   522                                                 """
   523                                                 Gets interpolated cycles for the step specified, charge or discharge.
   524                                         
   525                                                 Args:
   526                                                     v_range ([Float, Float]): list of two floats that define
   527                                                         the voltage interpolation range endpoints.
   528                                                     resolution (int): resolution of interpolated data.
   529                                                     step_type (str): which step to interpolate i.e. 'charge' or 'discharge'
   530                                                     reg_cycles (list): list containing cycle indicies of regular cycles
   531                                                     axis (str): which column to use for interpolation
   532                                                     desc (str): Description to print to tqdm column.
   533                                         
   534                                                 Returns:
   535                                                     pandas.DataFrame: DataFrame corresponding to interpolated values.
   536                                                 """
   537                                         
   538    463.7 MiB      0.0 MiB           1           if not desc:
   539                                                     desc = \
   540    463.7 MiB      0.0 MiB           1                   f"Interpolating {step_type} ({v_range[0]} - {v_range[1]})V " \
   541                                                         f"({resolution} points)"
   542                                         
   543    463.7 MiB      0.0 MiB           1           if step_type == "discharge":
   544                                                     step_filter = step_is_dchg
   545    463.7 MiB      0.0 MiB           1           elif step_type == "charge":
   546    463.7 MiB      0.0 MiB           1               step_filter = step_is_chg
   547                                                 else:
   548                                                     raise ValueError("{} is not a recognized step type")
   549                                                 incl_columns = [
   550    463.7 MiB      0.0 MiB           1               "test_time",
   551    463.7 MiB      0.0 MiB           1               "voltage",
   552    463.7 MiB      0.0 MiB           1               "current",
   553    463.7 MiB      0.0 MiB           1               "charge_capacity",
   554    463.7 MiB      0.0 MiB           1               "discharge_capacity",
   555    463.7 MiB      0.0 MiB           1               "charge_energy",
   556    463.7 MiB      0.0 MiB           1               "discharge_energy",
   557    463.7 MiB      0.0 MiB           1               "internal_resistance",
   558    463.7 MiB      0.0 MiB           1               "temperature",
   559                                                 ]
   560    463.7 MiB      0.0 MiB           1           all_dfs = []
   561    468.8 MiB      5.1 MiB           1           cycle_indices = self.raw_data.cycle_index.unique()
   562    468.8 MiB      0.0 MiB         251           cycle_indices = sorted([c for c in cycle_indices if c in reg_cycles])
   563                                         
   564                                         
   565    468.8 MiB      0.0 MiB           1           pm(cycle_indices, "cycle_indices in interpolate_step")
   566                                         
   567    475.2 MiB      0.0 MiB         232           for cycle_index in tqdm(cycle_indices, desc=desc):
   568                                                     # Use a cycle_index mask instead of a global groupby to save memory
   569                                                     new_df = (
   570    475.2 MiB   -153.4 MiB         231                   self.raw_data.loc[self.raw_data["cycle_index"] == cycle_index]
   571    475.2 MiB   -218.9 MiB         231                       .groupby("step_index")
   572    475.2 MiB   -218.8 MiB         231                       .filter(step_filter)
   573                                                     )
   574    475.2 MiB   -229.4 MiB         231               if new_df.size == 0:
   575                                                         continue
   576                                         
   577    475.2 MiB   -229.4 MiB         231               if axis in ["charge_capacity", "discharge_capacity"]:
   578    475.2 MiB   -229.4 MiB         231                   axis_range = [self.raw_data[axis].min(),
   579    475.2 MiB   -229.4 MiB         231                                 self.raw_data[axis].max()]
   580    475.2 MiB   -229.4 MiB         231                   new_df = interpolate_df(
   581    475.2 MiB   -229.4 MiB         231                       new_df,
   582    475.2 MiB   -229.4 MiB         231                       axis,
   583    475.2 MiB   -229.4 MiB         231                       field_range=axis_range,
   584    475.2 MiB   -229.4 MiB         231                       columns=incl_columns,
   585    475.2 MiB   -223.8 MiB         231                       resolution=resolution,
   586                                                         )
   587                                                     elif axis == "test_time":
   588                                                         axis_range = [new_df[axis].min(), new_df[axis].max()]
   589                                                         new_df = interpolate_df(
   590                                                             new_df,
   591                                                             axis,
   592                                                             field_range=axis_range,
   593                                                             columns=incl_columns,
   594                                                             resolution=resolution,
   595                                                         )
   596                                                     elif axis == "voltage":
   597                                                         new_df = interpolate_df(
   598                                                             new_df,
   599                                                             axis,
   600                                                             field_range=v_range,
   601                                                             columns=incl_columns,
   602                                                             resolution=resolution,
   603                                                         )
   604                                                     else:
   605                                                         raise ValueError(f"Axis {axis} not a valid step interpolation axis.")
   606    475.2 MiB      0.0 MiB         231               new_df["cycle_index"] = cycle_index
   607    475.2 MiB      0.0 MiB         231               new_df["step_type"] = step_type
   608    475.2 MiB      0.0 MiB         231               new_df["step_type"] = new_df["step_type"].astype("category")
   609    475.2 MiB      0.0 MiB         231               all_dfs.append(new_df)
   610                                         
   611                                                 # Ignore the index to avoid issues with overlapping voltages
   612    494.4 MiB     19.2 MiB           1           result = pd.concat(all_dfs, ignore_index=True)
   613                                         
   614                                                 # Cycle_index gets a little weird about typing, so round it here
   615    495.2 MiB      0.9 MiB           1           result.cycle_index = result.cycle_index.round()
   616                                         
   617    495.2 MiB      0.0 MiB           1           return result


Filename: /Users/ardunn/alex/tri/code/beep/beep/structure/base.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   619    546.6 MiB    546.6 MiB           1       @profile
   620                                             def interpolate_cycles(
   621                                                     self,
   622                                                     v_range=None,
   623                                                     resolution=1000,
   624                                                     diagnostic_available=None,
   625                                                     charge_axis='charge_capacity',
   626                                                     discharge_axis='voltage'
   627                                             ):
   628                                                 """
   629                                                 Gets interpolated cycles for both charge and discharge steps.
   630                                         
   631                                                 Args:
   632                                                     v_range ([float, float]): list of two floats that define
   633                                                         the voltage interpolation range endpoints.
   634                                                     resolution (int): resolution of interpolated data.
   635                                                     diagnostic_available (dict): dictionary containing information about
   636                                                         location of diagnostic cycles
   637                                                     charge_axis (str): column to use for interpolation for charge
   638                                                     discharge_axis (str): column to use for interpolation for discharge
   639                                         
   640                                                 Returns:
   641                                                     (pandas.DataFrame): DataFrame corresponding to interpolated values.
   642                                                 """
   643    546.6 MiB      0.0 MiB           1           if diagnostic_available:
   644    546.6 MiB      0.0 MiB           1               diag_cycles = list(
   645    546.6 MiB      0.0 MiB           1                   itertools.chain.from_iterable(
   646                                                             [
   647    546.6 MiB      0.0 MiB           7                           list(range(i, i + diagnostic_available["length"]))
   648    546.6 MiB      0.0 MiB           5                           for i in diagnostic_available["diagnostic_starts_at"]
   649    546.6 MiB      0.0 MiB           4                           if i <= self.raw_data.cycle_index.max()
   650                                                             ]
   651                                                         )
   652                                                     )
   653                                                     reg_cycles = [
   654    550.7 MiB      4.1 MiB         251                   i for i in self.raw_data.cycle_index.unique() if
   655    550.7 MiB      0.0 MiB         248                   i not in diag_cycles
   656                                                     ]
   657                                                 else:
   658                                                     reg_cycles = [i for i in self.raw_data.cycle_index.unique()]
   659                                         
   660    550.7 MiB      0.0 MiB           1           v_range = v_range or [2.8, 3.5]
   661                                         
   662                                                 # If any regular cycle contains a waveform step, interpolate on test_time.
   663                                         
   664    555.8 MiB      5.1 MiB           1           if self.raw_data[self.raw_data.cycle_index.isin(reg_cycles)]. \
   665    555.8 MiB      0.0 MiB           1                   groupby(["cycle_index", "step_index"]). \
   666    468.2 MiB    -87.6 MiB           1                   apply(step_is_waveform_dchg).any():
   667                                                     discharge_axis = 'test_time'
   668                                         
   669    469.3 MiB      1.1 MiB           1           if self.raw_data[self.raw_data.cycle_index.isin(reg_cycles)]. \
   670    469.3 MiB      0.0 MiB           1                   groupby(["cycle_index", "step_index"]). \
   671    463.8 MiB     -5.5 MiB           1                   apply(step_is_waveform_chg).any():
   672                                                     charge_axis = 'test_time'
   673                                         
   674    463.8 MiB      0.0 MiB           1           interpolated_discharge = self.interpolate_step(
   675    463.8 MiB      0.0 MiB           1               v_range,
   676    463.8 MiB      0.0 MiB           1               resolution,
   677    463.8 MiB      0.0 MiB           1               step_type="discharge",
   678    463.8 MiB      0.0 MiB           1               reg_cycles=reg_cycles,
   679    463.7 MiB     -0.0 MiB           1               axis=discharge_axis,
   680                                                 )
   681    463.7 MiB      0.0 MiB           1           interpolated_charge = self.interpolate_step(
   682    463.7 MiB      0.0 MiB           1               v_range,
   683    463.7 MiB      0.0 MiB           1               resolution,
   684    463.7 MiB      0.0 MiB           1               step_type="charge",
   685    463.7 MiB      0.0 MiB           1               reg_cycles=reg_cycles,
   686    483.4 MiB     19.7 MiB           1               axis=charge_axis,
   687                                                 )
   688    483.4 MiB      0.0 MiB           1           result = pd.concat(
   689    534.4 MiB     50.9 MiB           1               [interpolated_discharge, interpolated_charge], ignore_index=True
   690                                                 )
   691                                         
   692    589.9 MiB     55.6 MiB           1           return self._cast_dtypes(result, "cycles_interpolated")


SIZEOF: structured_data (): 20.790389 MB
Filename: /Users/ardunn/alex/tri/code/beep/beep/structure/base.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   695    546.7 MiB    546.7 MiB           1       @profile
   696                                             # equivalent of legacy get_summary
   697                                             def summarize_cycles(
   698                                                     self,
   699                                                     diagnostic_available=False,
   700                                                     nominal_capacity=1.1,
   701                                                     full_fast_charge=0.8,
   702                                                     cycle_complete_discharge_ratio=0.97,
   703                                                     cycle_complete_vmin=3.3,
   704                                                     cycle_complete_vmax=3.3,
   705                                                     error_threshold=1e6
   706                                             ):
   707                                                 """
   708                                                 Gets summary statistics for data according to cycle number. Summary data
   709                                                 must be float or int type for compatibility with other methods
   710                                         
   711                                                 Args:
   712                                                     diagnostic_available (dict): dictionary with diagnostic_types
   713                                                     nominal_capacity (float): nominal capacity for summary stats
   714                                                     full_fast_charge (float): full fast charge for summary stats
   715                                                     cycle_complete_discharge_ratio (float): expected ratio
   716                                                         discharge/charge at the end of any complete cycle
   717                                                     cycle_complete_vmin (float): expected voltage minimum achieved
   718                                                         in any complete cycle
   719                                                     cycle_complete_vmax (float): expected voltage maximum achieved
   720                                                         in any complete cycle
   721                                                     error_threshold (float): threshold to consider the summary value
   722                                                         an error (applied only to specific columns that should reset
   723                                                         each cycle)
   724                                         
   725                                                 Returns:
   726                                                     (pandas.DataFrame): summary statistics by cycle.
   727                                         
   728                                                 """
   729                                                 # Filter out only regular cycles for summary stats. Diagnostic summary computed separately
   730    546.7 MiB      0.0 MiB           1           if diagnostic_available:
   731    546.7 MiB      0.0 MiB           1               diag_cycles = list(
   732    546.7 MiB      0.0 MiB           1                   itertools.chain.from_iterable(
   733                                                             [
   734    546.7 MiB      0.0 MiB           7                           list(range(i, i + diagnostic_available["length"]))
   735    546.7 MiB      0.0 MiB           5                           for i in diagnostic_available["diagnostic_starts_at"]
   736    546.7 MiB      0.0 MiB           4                           if i <= self.raw_data.cycle_index.max()
   737                                                             ]
   738                                                         )
   739                                                     )
   740                                                     reg_cycles_at = [
   741    546.7 MiB      0.0 MiB         251                   i for i in self.raw_data.cycle_index.unique() if
   742    546.7 MiB      0.0 MiB         248                   i not in diag_cycles
   743                                                     ]
   744                                                 else:
   745                                                     reg_cycles_at = [i for i in self.raw_data.cycle_index.unique()]
   746                                         
   747    551.3 MiB      4.7 MiB           1           summary = self.raw_data.groupby("cycle_index").agg(self._aggregation)
   748                                         
   749                                                 # pd.set_option('display.max_rows', 500)
   750                                                 # pd.set_option('display.max_columns', 500)
   751                                                 # pd.set_option('display.width', 1000)
   752                                         
   753    551.3 MiB      0.0 MiB           1           summary.columns = self._summary_cols
   754                                         
   755    551.3 MiB      0.0 MiB           1           summary = summary[summary.index.isin(reg_cycles_at)]
   756                                                 summary["energy_efficiency"] = (
   757    551.3 MiB      0.0 MiB           1                   summary["discharge_energy"] / summary["charge_energy"]
   758                                                 )
   759                                                 summary.loc[
   760                                                     ~np.isfinite(summary["energy_efficiency"]), "energy_efficiency"
   761    551.3 MiB      0.0 MiB           1           ] = np.NaN
   762                                                 # This code is designed to remove erroneous energy values
   763    551.3 MiB      0.0 MiB           3           for col in ["discharge_energy", "charge_energy"]:
   764    551.3 MiB      0.0 MiB           2               summary.loc[summary[col].abs() > error_threshold, col] = np.NaN
   765    551.3 MiB      0.0 MiB           1           summary["charge_throughput"] = summary.charge_capacity.cumsum()
   766    551.3 MiB      0.0 MiB           1           summary["energy_throughput"] = summary.charge_energy.cumsum()
   767                                         
   768                                                 # This method for computing charge start and end times implicitly
   769                                                 # assumes that a cycle starts with a charge step and is then followed
   770                                                 # by discharge step.
   771                                                 charge_start_time = \
   772    551.3 MiB      0.0 MiB           1               self.raw_data.groupby("cycle_index", as_index=False)[
   773    551.3 MiB      0.0 MiB           1                   "date_time_iso"
   774    552.9 MiB      1.5 MiB           1               ].agg("first")
   775                                         
   776                                                 charge_finish_time = (
   777    552.9 MiB      0.0 MiB           1               self.raw_data[
   778    625.1 MiB     72.2 MiB           1                   self.raw_data.charge_capacity >= nominal_capacity * full_fast_charge]
   779    625.1 MiB      0.0 MiB           1               .groupby("cycle_index", as_index=False)["date_time_iso"]
   780    583.2 MiB    -41.9 MiB           1               .agg("first")
   781                                                 )
   782                                         
   783                                                 # Left merge, since some cells might not reach desired levels of
   784                                                 # charge_capacity and will have NaN for charge duration
   785    583.2 MiB      0.0 MiB           1           merged = charge_start_time.merge(
   786    583.3 MiB      0.1 MiB           1               charge_finish_time, on="cycle_index", how="left"
   787                                                 )
   788                                         
   789                                                 # Charge duration stored in seconds - note that date_time_iso is only ~1sec resolution
   790    583.3 MiB      0.0 MiB           1           time_diff = np.subtract(
   791    583.3 MiB      0.0 MiB           1               pd.to_datetime(merged.date_time_iso_y, utc=True, errors="coerce"),
   792    583.3 MiB      0.0 MiB           1               pd.to_datetime(merged.date_time_iso_x, errors="coerce"),
   793                                                 )
   794    583.3 MiB      0.0 MiB           1           summary["charge_duration"] = np.round(
   795    583.4 MiB      0.0 MiB           1               time_diff / np.timedelta64(1, "s"), 2)
   796                                         
   797                                                 # Compute time since start of cycle in minutes. This comes handy
   798                                                 # for featurizing time-temperature integral
   799    583.4 MiB      0.0 MiB           1           self.raw_data["time_since_cycle_start"] = pd.to_datetime(
   800    583.4 MiB      0.0 MiB           1               self.raw_data["date_time_iso"]
   801    583.4 MiB      0.0 MiB           1           ) - pd.to_datetime(
   802    583.4 MiB      0.0 MiB           1               self.raw_data.groupby("cycle_index")["date_time_iso"].transform(
   803    529.2 MiB    -54.2 MiB           1                   "first")
   804                                                 )
   805    529.2 MiB      0.0 MiB           1           self.raw_data["time_since_cycle_start"] = (self.raw_data[
   806    529.2 MiB      0.0 MiB           1                                                          "time_since_cycle_start"] / np.timedelta64(
   807    528.8 MiB     -0.4 MiB           1               1, "s")) / 60
   808                                         
   809                                                 # Group by cycle index and integrate time-temperature
   810                                                 # using a lambda function.
   811    528.8 MiB      0.0 MiB           1           if "temperature" in self.raw_data.columns:
   812    528.8 MiB      0.0 MiB           1               summary["time_temperature_integrated"] = self.raw_data.groupby(
   813    528.8 MiB      0.0 MiB           1                   "cycle_index").apply(
   814    761.7 MiB    136.2 MiB         497                   lambda g: integrate.trapz(g.temperature, x=g.time_since_cycle_start)
   815                                                     )
   816                                         
   817                                                 # Drop the time since cycle start column
   818    647.2 MiB   -114.6 MiB           1           self.raw_data.drop(columns=["time_since_cycle_start"])
   819                                         
   820                                                 # Determine if any of the cycles has been paused
   821    647.2 MiB      0.0 MiB           1           summary["paused"] = self.raw_data.groupby("cycle_index").apply(
   822    490.8 MiB   -156.4 MiB           1               get_max_paused_over_threshold)
   823                                         
   824    490.8 MiB      0.0 MiB           1           summary = self._cast_dtypes(summary, "summary")
   825                                         
   826    490.8 MiB      0.0 MiB           1           last_voltage = self.raw_data.loc[
   827    490.8 MiB      0.0 MiB           1               self.raw_data["cycle_index"] == self.raw_data["cycle_index"].max()
   828    490.8 MiB      0.0 MiB           1               ]["voltage"]
   829                                                 if (
   830    490.8 MiB      0.0 MiB           1                   (last_voltage.min() < cycle_complete_vmin)
   831    490.8 MiB      0.0 MiB           1                   and (last_voltage.max() > cycle_complete_vmax)
   832                                                         and (
   833                                                         (summary.iloc[[-1]])["discharge_capacity"].iloc[0]
   834                                                         > cycle_complete_discharge_ratio
   835                                                         * (summary.iloc[[-1]])["charge_capacity"].iloc[0]
   836                                                             )
   837                                                 ):
   838                                                     return summary
   839                                                 else:
   840    490.8 MiB      0.0 MiB           1               return summary.iloc[:-1]


SIZEOF: structured_summary (): 0.039124 MB
Filename: /Users/ardunn/alex/tri/code/beep/beep/structure/base.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   123    299.1 MiB    299.1 MiB           1               def wrapper(*args, **kwargs):
   124    299.1 MiB      0.0 MiB           1                   if args[0]._is_legacy:
   125                                                             raise ValueError(
   126                                                                 f"{args[0].__class__.__name__} is deserialized from a legacy file! Operation not allowed."
   127                                                             )
   128                                                         else:
   129    490.8 MiB    191.6 MiB           1                       return func(*args, **kwargs)

Source code for running:

from beep.structure.maccor import MaccorDatapath
import pandas as pd
import os
from beep.tests.constants import TEST_FILE_DIR

os.environ["BEEP_PROCESSING_DIR"] = TEST_FILE_DIR

maccor_file_w_parameters = os.path.join(
    TEST_FILE_DIR, "PreDiag_000287_000128.092"
)

md = MaccorDatapath.from_file(maccor_file_w_parameters)


print("MEM: GETTING STRUCTURING PARAMETERS")
(
    v_range,
    resolution,
    nominal_capacity,
    full_fast_charge,
    diagnostic_available,
) = md.determine_structuring_parameters()

print("MEM: Structuring with parameters")
structured_data = md.structure(v_range=v_range, resolution=resolution, nominal_capacity=nominal_capacity, full_fast_charge=full_fast_charge, diagnostic_available=diagnostic_available)

from beep.

[mat-2532] Improve structuring compute speed and memory usage with dask/modin/swifter about beep HOT 2 OPEN

Comments (2)

Problem areas

Full output of memory profiling while structuring:

Source code for running:

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent