Back to CFM home             Brown University



Results of parallel 2D FFT (1024x1024)

esp3:ce107/work/SP2-tutorial% mpirun -np 1 2dfft1x1
 2dfft achieves  23.6892973841184826  Mflop/s
 ( 10  iterations in  44.2637020000000021  secs)
 Using unit-stride loads for copying
 ************************************
 2dfft achieves  25.8000219770237216  Mflop/s
 ( 10  iterations in  40.6424460000000067  secs)
 Using unit-stride stores for copying
 ************************************
 row_2dfft achieves  29.1567400283460358  Mflop/s
 ( 10  iterations in  35.9634169999999926  secs)
 ************************************
 "transpose" row_2dfft achieves  48.1264008311380067  Mflop/s
 ( 10  iterations in  21.7879580000000033  secs)
 ************************************
 in-place pdcft2 achieves  69.1892987583912742  Mflop/s
 ( 10  iterations in  15.1551759999999831  secs)
 ************************************
 pdcft2 achieves  69.7629223868726740  Mflop/s
 ( 10  iterations in  15.0305630000000008  secs)
 ************************************
 "transpose" pdcft2 achieves  97.0241513870488035  Mflop/s
 ( 10  iterations in  10.8073709999999892  secs)
 ************************************
 in-place dual dcft achieve  55.3922864815433584  Mflop/s
 ( 10  iterations in  18.9300003051757812  secs)
 ************************************
 The leading dimension is  1028  for  m =  1024
 dual dcft achieve  73.1734825947448542  Mflop/s
 ( 10  iterations in  14.3299999237060547  secs)
 ************************************
 in-place dcft2 achieves  55.7456687810419567  Mflop/s
 ( 10  iterations in  18.8099994659423828  secs)
 ************************************
 The leading dimension is  1028  for  m =  1024
 dcft2 achieves  88.0416468038697815  Mflop/s
 ( 10  iterations in  11.9099998474121094  secs)
 ************************************
 The leading dimension is  1028  for  n =  1024
 "transpose" dfct2 achieves  100.438318213179656  Mflop/s
 ( 10  iterations in  10.4399995803833008  secs)
 ************************************

esp3:ce107/work/SP2-tutorial% mpirun -np 2 2dfft2x1
 2dfft achieves  33.9549647643603976  Mflop/s
 ( 10  iterations in  30.8813749999999985  secs)
 Using unit-stride loads for copying
 ************************************
 2dfft achieves  36.4694179923623238  Mflop/s
 ( 10  iterations in  28.7522000000000020  secs)
 Using unit-stride stores for copying
 ************************************

esp3:ce107/work/SP2-tutorial% mpirun -np 2 2dfft1x2
 2dfft achieves  45.7082437833447628  Mflop/s
 ( 10  iterations in  22.9406320000000008  secs)
 Using unit-stride loads for copying
 ************************************
 2dfft achieves  51.2392250278313597  Mflop/s
 ( 10  iterations in  20.4643220000000028  secs)
 Using unit-stride stores for copying
 ************************************
 row_2dfft achieves  52.2715384629572597  Mflop/s
 ( 10  iterations in  20.0601710000000040  secs)
 ************************************
 "transpose" row_2dfft achieves  84.5677501898905746  Mflop/s
 ( 10  iterations in  12.3992420000000081  secs)
 ************************************
 in-place pdcft2 achieves  63.5959653301261767  Mflop/s
 ( 10  iterations in  16.4880899999999997  secs)
 ************************************
 pdcft2 achieves  63.6752036062146800  Mflop/s
 ( 10  iterations in  16.4675720000000041  secs)
 ************************************
 "transpose" pdcft2 achieves  119.927017375633582  Mflop/s
 ( 10  iterations in  8.74345099999999320  secs)
 ************************************

esp3:ce107/work/SP2-tutorial% mpirun -np 2 2dfft4x1
 2dfft achieves  58.4136069941060327  Mflop/s
 ( 10  iterations in  17.9508860000000006  secs)
 Using unit-stride loads for copying
 ************************************
 2dfft achieves  61.9241500972761756  Mflop/s
 ( 10  iterations in  16.9332320000000003  secs)
 Using unit-stride stores for copying
 ************************************

esp3:ce107/work/SP2-tutorial% mpirun -np 2 2dfft2x2
 2dfft achieves  63.2847555689566121  Mflop/s
 ( 10  iterations in  16.5691720000000018  secs)
 Using unit-stride loads for copying
 ************************************
 2dfft achieves  66.5042056342389145  Mflop/s
 ( 10  iterations in  15.7670630000000003  secs)
 Using unit-stride stores for copying
 ************************************

esp3:ce107/work/SP2-tutorial% mpirun -np 2 2dfft1x4
 2dfft achieves  79.7560258593117624  Mflop/s
 ( 10  iterations in  13.1472949999999997  secs)
 Using unit-stride loads for copying
 ************************************
 2dfft achieves  85.1894098186982376  Mflop/s
 ( 10  iterations in  12.3087599999999995  secs)
 Using unit-stride stores for copying
 ************************************
 row_2dfft achieves  84.3598991841536900  Mflop/s
 ( 10  iterations in  12.4297919999999991  secs)
 ************************************
 "transpose" row_2dfft achieves  145.764395084766505  Mflop/s
 ( 10  iterations in  7.19363600000000503  secs)
 ************************************
 in-place pdcft2 achieves  141.969354713799078  Mflop/s
 ( 10  iterations in  7.38593200000000394  secs)
 ************************************
 pdcft2 achieves  144.872379930778521  Mflop/s
 ( 10  iterations in  7.23792900000000117  secs)
 ************************************
 "transpose" pdcft2 achieves  276.454384773410936  Mflop/s
 ( 10  iterations in  3.79294399999999854  secs)
 ************************************

esp3:ce107/work/SP2-tutorial% mpirun -np 8 2dfft8x1
 2dfft achieves  95.6782305600671634  Mflop/s
 ( 10  iterations in  10.9594000000000005  secs)
 Using unit-stride loads for copying
 ************************************
 2dfft achieves  103.349381141253232  Mflop/s
 ( 10  iterations in  10.1459340000000005  secs)
 Using unit-stride stores for copying
 ************************************

esp3:ce107/work/SP2-tutorial% mpirun -np 8 2dfft4x2
 2dfft achieves  112.462095727533054  Mflop/s
 ( 10  iterations in  9.32381700000000002  secs)
 Using unit-stride loads for copying
 ************************************
 2dfft achieves  115.498105516752418  Mflop/s
 ( 10  iterations in  9.07872899999999916  secs)
 Using unit-stride stores for copying
 ************************************

esp3:ce107/work/SP2-tutorial% mpirun -np 8 2dfft2x4
 2dfft achieves  111.974724699364060  Mflop/s
 ( 10  iterations in  9.36439900000000058  secs)
 Using unit-stride loads for copying
 ************************************
 2dfft achieves  114.444076755904405  Mflop/s
 ( 10  iterations in  9.16234399999999916  secs)
 Using unit-stride stores for copying
 ************************************

esp3:ce107/work/SP2-tutorial% mpirun -np 8 2dfft1x8
 2dfft achieves  121.504788865243739  Mflop/s
 ( 10  iterations in  8.62991500000000045  secs)
 Using unit-stride loads for copying
 ************************************
 2dfft achieves  128.355217423609702  Mflop/s
 ( 10  iterations in  8.16932899999999940  secs)
 Using unit-stride stores for copying
 ************************************
 row_2dfft achieves  143.933640213942482  Mflop/s
 ( 10  iterations in  7.28513500000000036  secs)
 ************************************
 "transpose" row_2dfft achieves  241.600637401583981  Mflop/s
 ( 10  iterations in  4.34012099999999990  secs)
 ************************************
 in-place pdcft2 achieves  276.854307564558553  Mflop/s
 ( 10  iterations in  3.78746499999999742  secs)
 ************************************
 pdcft2 achieves  279.769103829904225  Mflop/s
 ( 10  iterations in  3.74800499999999914  secs)
 ************************************
 "transpose" pdcft2 achieves  517.565854679423637  Mflop/s
 ( 10  iterations in  2.02597600000000000  secs)
 ************************************

esp3:ce107/work/SP2-tutorial% mpirun -np 16 2dfft16x1
 2dfft achieves  169.356879743688012  Mflop/s
 ( 10  iterations in  6.19151700000000016  secs)
 Using unit-stride loads for copying
 ************************************
 2dfft achieves  174.396550175664544  Mflop/s
 ( 10  iterations in  6.01259600000000027  secs)
 Using unit-stride stores for copying
 ************************************

esp3:ce107/work/SP2-tutorial% mpirun -np 16 2dfft8x2
 2dfft achieves  187.965016124203885  Mflop/s
 ( 10  iterations in  5.57857000000000003  secs)
 Using unit-stride loads for copying
 ************************************
 2dfft achieves  196.006908065123071  Mflop/s
 ( 10  iterations in  5.34968899999999969  secs)
 Using unit-stride stores for copying
 ************************************

esp3:ce107/work/SP2-tutorial% mpirun -np 16 2dfft4x4
 2dfft achieves  196.886840988550063  Mflop/s
 ( 10  iterations in  5.32577999999999996  secs)
 Using unit-stride loads for copying
 ************************************
 2dfft achieves  209.046960486110095  Mflop/s
 ( 10  iterations in  5.01598300000000119  secs)
 Using unit-stride stores for copying
 ************************************

esp3:ce107/work/SP2-tutorial% mpirun -np 16 2dfft2x8
 2dfft achieves  184.055248029163067  Mflop/s
 ( 10  iterations in  5.69707200000000036  secs)
 Using unit-stride loads for copying
 ************************************
 2dfft achieves  192.622283781773035  Mflop/s
 ( 10  iterations in  5.44369000000000014  secs)
 Using unit-stride stores for copying
 ************************************

esp3:ce107/work/SP2-tutorial% mpirun -np 16 2dfft1x16
 2dfft achieves  220.817320762925959  Mflop/s
 ( 10  iterations in  4.74861299999999975  secs)
 Using unit-stride loads for copying
 ************************************
 2dfft achieves  222.642490952154958  Mflop/s
 ( 10  iterations in  4.70968500000000034  secs)
 Using unit-stride stores for copying
 ************************************
 row_2dfft achieves  237.790345119108736  Mflop/s
 ( 10  iterations in  4.40966600000000142  secs)
 ************************************
 "transpose" row_2dfft achieves  414.412036008748771  Mflop/s
 ( 10  iterations in  2.53027399999999858  secs)
 ************************************
 in-place pdcft2 achieves  553.576632470763457  Mflop/s
 ( 10  iterations in  1.89418399999999920  secs)
 ************************************
 pdcft2 achieves  557.139153018587763  Mflop/s
 ( 10  iterations in  1.88207200000000086  secs)
 ************************************
 "transpose" pdcft2 achieves  911.919405560348537  Mflop/s
 ( 10  iterations in  1.14985599999999977  secs)
 ************************************