wiki:DevelopmentActivities/Branches/ORCHIDEE-MICT-IMBALANCE-P/SimulationTimes

Version 63 (modified by ajornet, 8 years ago) (diff)

--

Performance

Basic Performance Report

Overview

This document tries to understand Orchidee MICT computing time behavior. In the latest version 6.5 it takes a lot of time to compute. Around 8h in 0.5 degrees for 1 year. So it is necessary to understand why It happens. Once the issues are identified it might be possible to apply different solutions.

In order to make such thing possible the code is profiled. Different tools are used (vtune, vampir, gprof, ...). They provide an easy way to identify basic hotspots in the code.

Report

attachment:performance_mict_albert_jornet_150616.pdf

MICT V6 (3344 + PFT interpolation) Module computing time

Starting from a basic configuration. At each test a new module is activated. This increases the numbers of modules at each test. Its purposes is to show the impact of each module when is used.

Perf_MICT_options

Trunk vs MICT Comparision 11/04/2016

  • Date 11/04/2016
  • ADA Machine
  • IOIPSL production mode
  • Orchidee production mode
  • 1Y
  • 16 cores
  • Forcing:
    • 1 Degree
    • 3H

Considerations:

  • MICT is in the same level of modifications as Trunk revision 3346
  • MICT is using parallel interpolation for aggregate 2D subroutine

Overview

Orchidee vs trunk profiling

Subroutines are placed in 4 different groups described below:

  • ioipsl: all subroutines related to IOIPSL library
  • Top orchidee: subroutines >1% of computing time
  • Interpolation: interpolation time by aggregate_2D subroutine
  • other orchidee: remaining subroutines from orchidee

Mict R3359 (gprof)

This is a profiling test done with gprof tool:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ks/call  Ks/call  name    
 25.66   1383.92  1383.92  2245127     0.00     0.00  mathelp_mp_ma_fuscat_r21_
  9.62   1902.84   518.92  3835809     0.00     0.00  mathelp_mp_moycum_index_
  9.18   2398.02   495.18  3835826     0.00     0.00  histcom_mp_histwrite_real_
  5.96   2719.41   321.39    17524     0.00     0.00  thermosoil_mp_thermosoil_cond_pft_
  3.81   2924.90   205.49    17520     0.00     0.00  hydrol_mp_hydrol_soil_
  3.62   3119.87   194.97   420480     0.00     0.00  hydrol_mp_hydrol_soil_coef_
  3.59   3313.39   193.52    17524     0.00     0.00  thermosoil_mp_thermosoil_getdiff_
  3.11   3481.04   167.65      365     0.00     0.00  stomate_wet_ch4_pt_ter_wet2_mp_ch4_wet_flux_density_wet2_
  3.05   3645.33   164.29      365     0.00     0.00  stomate_wet_ch4_pt_ter_wet1_mp_ch4_wet_flux_density_wet1_
  2.92   3803.03   157.70      365     0.00     0.00  stomate_wet_ch4_pt_ter_wet3_mp_ch4_wet_flux_density_wet3_
  2.86   3957.34   154.31      365     0.00     0.00  stomate_wet_ch4_pt_ter_0_mp_ch4_wet_flux_density_0_
  2.74   4105.24   147.90      365     0.00     0.00  stomate_wet_ch4_pt_ter_wet4_mp_ch4_wet_flux_density_wet4_
  2.67   4249.50   144.26    17522     0.00     0.00  thermosoil_mp_thermosoil_coef_
  1.63   4337.37    87.87    17520     0.00     0.00  hydrol_mp_hydrol_diag_soil_
  1.59   4423.39    86.02  2666157     0.00     0.00  mod_orchidee_omp_transfert_mp_gather_omp_r1_
  1.57   4507.82    84.43       55     0.00     0.00  interpol_help_mp_aggregate_2d_
  1.37   4581.90    74.08    17520     0.00     0.00  diffuco_mp_diffuco_trans_co2_
  1.36   4655.06    73.16    17520     0.00     0.00  stomate_mp_stomate_main_
  1.22   4720.59    65.53    17520     0.00     0.00  stomate_permafrost_soilcarbon_mp_microactem_
  1.06   4777.86    57.27    17520     0.00     0.00  hydrol_mp_hydrol_main_
  0.96   4829.85    51.99  1602027     0.00     0.00  mathelp_mp_ma_fuscat_r11_
  0.77   4871.20    41.35    17522     0.00     0.00  thermosoil_mp_thermosoil_readjust_
  0.74   4911.35    40.15  2664512     0.00     0.00  mod_orchidee_omp_transfert_mp_gather_omp_i1_

Total Simulation time: 5358 seconds

IO: mathelp + histcom = 25.66 + 9.62 + 9.18 = ~45%

Trunk R3346

This is a profiling test done with gprof tool:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ks/call  Ks/call  name    
 22.26    441.54   441.54        7     0.06     0.06  interpol_help_mp_aggregate_2d_
 14.52    729.66   288.12  2171415     0.00     0.00  histcom_mp_histwrite_real_
 13.26    992.66   263.00    17520     0.00     0.00  hydrol_mp_hydrol_soil_
 10.28   1196.56   203.90   773813     0.00     0.00  mathelp_mp_ma_fuscat_r21_
  5.07   1297.17   100.61  2171397     0.00     0.00  mathelp_mp_moycum_index_
  4.16   1379.77    82.60    17520     0.00     0.00  diffuco_mp_diffuco_trans_co2_
  3.81   1455.34    75.57    17520     0.00     0.00  hydrol_mp_hydrol_diag_soil_
  3.67   1528.21    72.87   157680     0.00     0.00  hydrol_mp_hydrol_soil_coef_
  2.29   1573.69    45.48  1400412     0.00     0.00  mathelp_mp_ma_fuscat_r11_
  2.27   1618.76    45.07    17520     0.00     0.00  hydrol_mp_hydrol_main_
  1.86   1655.66    36.90    17521     0.00     0.00  thermosoil_mp_thermosoil_getdiff_
  1.46   1684.63    28.97    17521     0.00     0.00  thermosoil_mp_thermosoil_humlev_
  0.99   1704.17    19.54   157680     0.00     0.00  hydrol_mp_hydrol_soil_tridiag_
  0.94   1722.82    18.65    17520     0.00     0.00  stomate_litter_mp_littercalc_
  0.92   1740.99    18.17    17520     0.00     0.00  hydrol_mp_hydrol_split_soil_
  0.86   1758.10    17.11    17520     0.00     0.00  stomate_mp_stomate_main_
  0.81   1774.07    15.98  1133588     0.00     0.00  mod_orchidee_omp_transfert_mp_gather_omp_r1_

Total Simulation time: 1956 seconds

IO: mathelp + histcom = 14.25 + 10.28 + 5.07 = ~30%

Trunk vs MICT Comparision 18/02/2016

18/02/2016: revisions trunk 2916 and MICT 3161 were considered to be equivalents.

The same run.def file is used to compare both developments.

The simulations were carried out under the following conditions:

  • 1 Year
  • Global
  • CRU-NCEP v5.3.2 (6 hourly)
  • CURIE
  • IO library: IOIPSL/XIOS
    • Yearly output
  • Compilation mode IOIPSL: production
  • Compilation mode Orchidee: production
  • Compilation mode XIOS: production
    • e.g: 64 cores = 64 ORC + 1 XIOS

Summary

trunk_vs_mict_performance

Mict R3811 (XIOS 2)

N procs 8 16 32 64 128 256
0.5 deg out of memory out of memory out of memory out of memory 1h33 1h27
1 deg 2h48 1h27 44m24 25m21 19m52 13m47
2 deg 43m48 21m31 11m13 7m40 5m16 4m30

Output Netcdf files:

  • 0.5 Degree
Filename Size # vars
stomate_rest_out.nc 14G 379 (double)
sechiba_rest_out.nc 8.5G 234 (double)
driver_rest_out.nc 28M 13 (double)
sechiba_history.nc 3.0G 114 (float)
stomate_history.nc 3.3G 297 (float)

Changes

  • CROP restart variables are now only active when CROP is enabled.
  • XIOS history outputs now include 4D/5D dimension. It allows to reduce the number of variables in the outputs.

Issues

  • 0.5 deg Out of memory is due to XIOS

Conclusion

  • 0.5deg - 64 procs: it might be due to 4D/5D variables.
  • 0.5deg computing time: less restart variables to write decreased total time. CROP module still has this problem.

Mict R3791 (XIOS 2)

  • Date: 30/09/16
  • Add all XIOS output fields

Time table:

N procs 8 16 32 64 128 256
0.5 deg out of memory out of memory out of memory 14h56 15h42 11h06
1 deg 3h19 2h28 1h52 2h22 1h32 1h23
2 deg 54m38 38m53 23m58 24m43 24m37 16m11

Output Netcdf files:

  • 0.5 Degree
Filename Size # vars
stomate_rest_out.nc 20G 611 (double)
sechiba_rest_out.nc 8.6G 234 (double)
driver_rest_out.nc 28M 13 (double)
sechiba_history.nc 2.0G 388 (float)
stomate_history.nc 3.8G 1179 (float)

Changes

  • Add CROP module.

Issues

  • 0.5 deg Memory requirements are high
  • 0.5 deg Simulation Time is far too high. Even when the module is disabled.

Conclusion

  • 0.5deg Memory: the introduction of XIOS increases the memory usage
  • 0.5deg simulation time: a lots of more restart variables to write

Mict R3587 (XIOS 2!)

  • Small fixes
  • Trunk update

Time table:

N procs 8 16 32 64 128 256
0.5 deg out of memory 4h43 2h21 1h18 58 47
1 deg running 1h05 35 23 16 18
2 deg - - - - - -

Mict R3587 (XIOS 2 + thermosoil_cond_pft)

This specific branch involves the subroutine thermosoil_cond_pft. It is shown in some profiling reports to be highly consuming. The next tests are an effort to improve the performance.

All tests are done with 0.5 degres. All other parameters are the same specified in this section.

N procs 32 64 128 256
Avx + align 32 + vecalign32 2h08 1h16 49 40
Align 32 + vecalign32 2h13 1h30 54 43
Avx + align 32 2h12 1h18 1h02 47
Align 16 2h23 1h13 1h04 50

Description:

  • avx: 256 bit register
  • align 32: -align array32byte compilation flag
  • align 16: -align array16byte compilation flag
  • vecalign: source code lines to help the compiler improve the performance

Mict R3567

  • New driver

Time table:

N procs 8 16 32 64 128 256
0.5 deg timeout 11h22 7h21 4h51 3h38 2h49
1 deg 4h22 2h21 1h24 54 39 30
2 deg 58 31 19 13 9 7

Mict R3527

  • PFT parallel interpolation

Time table:

N procs 8 16 32 64 128 256
0.5 deg >16h39 322 days 11h09 7h06 4h50 3h31 2h47
1 deg 4h10 2h14 1h20 52 37 30
2 deg 55m05 30m01 18m05 12 9 7

Mict R3161

Time table:

N procs 4 8 16 32 64 128
0.5 deg timeout timeout 13h00 8h46 6h35 5h38
1 deg 6h37 4h20 2h36 1h45 1h21 1h08
2 deg 1h40 56 35 24 19 16

Note: 0.5 deg in 4 N procs did not start due to memory requirements. 0.5 deg in 8 N procs could not finish the simulation in the maximum time given by the HPC. It stopped at the simulation day 322. Both values can be extrapolated.

Trunk R2916

The same simulations with the same options where carried out with the following results:

N procs 4 8 16 32 64 128
0.5 deg 8h38 5h31 3h26 2h23 1h48 1h31
1 deg 2h07 1h17 47 32 25 21
2 deg 38 19 11 8 6 5

IOIPSL

Restart File Creation :

Attachments (12)