|Posted on September 24, 2010 at 11:04 AM|
As I mentioned few months ago we want to produce benchmark calculations for few-electron harmonium. The idea is to use this information in the calibration of quantum chemical methods, and specially we have in mind the design of a new functional using these data. To this aim, we developed a method to produce highly accurate results for few-electron harmonium. The method uses an extrapolation scheme based on few FCI calculations. Since these calculations for more than three electrons are prohibitive, we are developing an MPI version of the code. This post is about such possibility and a few benchmark results we have obtained so far.
In order for the extrapolation scheme to work we must use a particular basis set. There are many basis sets adequate for extrapolation schemes, but these basis are optimized for molecular systems and they are not so good for harmonium. Our basis set consists of even temperated gaussian functions, the exponents of which (alpha, beta) are optimized for each calculation using the simplex method. The simplex method shrinks and shift along a grid of nine points, and therefore we can expect few calculations that can run simultaneously. We always start with 9 calculations needed and every time we shrink the grid we need additional eight calculations, if we shift, between 3-5 calculations are needed for the next step. Therefore, in this first step we may use between 3-9 processors. This part is trivial to parallelize (it's an embarrassingly parallel problem).
In order to perform the FCI calculation we use a modified version of the Peter Knowles' code (CPL 1989). This code uses D2h symmetry to perform Davidson iterations over a FCI matrix constructed from Slater determinants. Since the program uses symmetry, it can be parallelized through the different irreducible representations of D2h point group. Namely, the calculation of the sigma vector -which involves a double loop over the eight irreducible representations- can be parallelized. We may thus use up to 64 processors in this part of the program.
2) The Sigma vector.
Each of these 64 tasks running in parallel contain a main procedure that takes most of time: the calculation of the sigma vector. The calculation of the sigma vector involves a large matrix multiplication. We can also split this last task among different processors (as simple as letting one processor handle one column at a time).
Therefore, altogether we can expect our code to parallelize very well among 192N to 576N processors. This factor N is the number of processors among which we could split our matrix multiplication. We have requested by third time [sic] computational time in the Barcelona Supercomputer Center (BSC), in there the maximum number of processors we can use in one parallel job is 1024. So this means we should prove the MPI subroutine for matrix multiplication parallelizes well for 2-5 processors. Let me give you some numbers of parallelization (these are lower bounds to the actual performance because few processes are running in the same workstation):
1 proc 3 proc* 5 proc*
total time (s) 26,689 10,077 (2.6) 6,737 (4.0)
largest multiplication (s) 1,996.85 670.97 (3.0) 403.32 (5.0)
* The numbers in brackets correspond to the number of processors used (ideally they should correspond to the numbers of processors requested).
These numbers prove that an efficient parallelization among up to thousand processors is possible with our modified FCI code. Nowadays I have implemented the matrix multiplication in parallel and I'm working in parallelization among the different irreducible representations (I expect to have it ready in less than two weeks). The parallelization of the optimization process is quite trivial and can be done in a few hours.