In this project we propose to work on developing novel algorithms and techniques based on Monte Carlo and demonstrate that in many crucial scientific areas they are the best positioned to reap the benefits of the computational capabilities of the next generation of machines. To this end, we organized this proposal in five tasks, and believe we can make fundamental contributions in each:

- Task 1 focus on developing a highly-optimized, fully distributed implementation of a library of matrix functions computed using a Monte Carlo method, making use of the PI’s expertise on large-scale computing, and incorporating the techniques developed in Task 2;
- Task 2 addresses one major drawback of Monte Carlo methods, namely the low rate of convergence, and leverages on previous contributions and new ideas by the co-PI towards techniques for variance reduction;
- Task 3 applies the library developed in Task 1 to the analysis of complex networks;
- Task 4 addresses the direct solution of large dynamic systems using the matrix exponential;
- Task 5 again uses the tool from Task 1 for the more exploratory analysis of problems described by fractional- order differential equations.

The work to be carried out in Task 2 will have two main thrusts:

- Development of new variance reduction techniques, extending previous work on multilevel and importance sampling methods;
- The analysis of the suitability of quasi-random sequence of numbers for pseudorandom generation of samples.

An interesting feature of the multilevel method is that it automatically determines the optimal number of levels and samples per level to achieve a desired maximum error value. We have been able to adapt the multilevel method for the problem of computing the action of a matrix exponential over a vector [1,2] and would like to extend this approach to other matrix functions as well. This work will require both theoretical research and empirical experimentation. The results will be directly implemented in the matrix functions distributed implementation to be developed under Task 1. Still, we believe these results can have a much wider application and will naturally be amply divulged.

The method for computing the different matrix functions to be developed in Task 1 is based on the sum of powers of the matrix, using different weights for each power as defined by the function's Neumann Series [7]. We will use a Monte Carlo-based approach where a matrix to the power k is computed using random walks of length k over the matrix [3]. An additional important and potentially very relevant contribution will be the extension of this approach to the solution fractional partial differential equations.

Although the base for the method was developed in [3], there are many open issues that need to be addressed. The first is that there is an error associated with truncating the Neumann series. We need to perform a detailed study of the error due to considering a finite number of powers of the matrix. Moreover, this error will depend on the function to compute. We can exploit the sensitivity of different functions to higher powers of the matrix to automatically determine the number of powers to use, optimizing computation time while keeping the error below a given threshold. The other source of error is related to the number of samples N, in our case random walks to use. In the classic Monte Carlo method this error is in the order of sqrt(N), and it is the objective of Task 2 to improve this rate of convergence.

At another level, the work of [3] needs to be extended to improve the effective use of the available computational units. The splitting of the matrix across computing nodes has the adverse effect that the random walks may have to migrate between computing nodes, with potentially high communication overhead. This has partially been addressed in [3], but must be further analyzed, in terms of buffering to reduce the number of messages, and in terms of synchronicity to overlap as much as possible computation and communication. New features in OpenMP and in MPI will also be explored [5,35].

Increasingly high-performance systems rely on heterogeneous computational units, namely for energy efficiency. In this project, we want to explore the most common accelerators available, such as Graphics Processing Units (GPU) and Field- Programmable Gate Arrays (FPGA). GPUs have been widely used for Monte Carlo in the past because of the independence between samples. While we can explore this feature within a node, the fact that samples may have to migrate to another node creates some difficulties that we need to solve. As for FPGAs, the research team has background on hardware design on which we may leverage to better exploit this type of accelerator. Furthermore, we need to take advantage of vector processing, making full use of the CPU's SIMD instructions.

Armed with this tool, we plan in remaining tasks to carry out research in three major scientific areas. We firmly consider that the numerical tools to be developed have the potential to unleash significant progress in these fields. These efforts will also demonstrate the merits of the tools and help in their dissemination.

The analysis of complex networks, the focus of Task 3, is one topic that has received a renewed interest from the scientific community as it blends with a wide range of areas, with networks emerging for example from social interactions, biological phenomena, the world wide web, financial networks, and the power grid. One fundamental metric in the study of networks is node centrality for which we have presented initial results in [3], but that we need to extend. And there are several formal definitions of node centrality [23] that we plan to experiment with. Moreover, other network metrics are to be studied: sub- graph centrality; Estrada Index; number of triangles in a network; total communicability.

The direct solution of individual state variables in dynamical systems is another outcome made viable by our method and that we address in Task 4. To assess the efficiency of such a procedure, we will be analyzing the power grids of electronic circuits. For proper operation, power and ground voltages need to reach every device in the circuit at all times and be kept at steady, proper levels, delivering the required energy when requested. The efficient computation of the voltage and/current at individual circuit elements at any given time instant may represent a very significant contribution, which can be easily transposed to many other areas. Another dynamic system that we plan to explore is in the area of computational chemistry. We can describe chemical reactions in terms of a differential system of equations and use the matrix exponential to determine the concentration of individual molecular species in equilibrium.

In the last few years the field of fractional differential equations has become a mature field of very active research. This is mainly because fractional calculus has arisen as an important mathematical tool to accurately describe a variety of natural phenomena characterized by memory effects. The traditional approach widely explored by the fractional community consists of classical algorithms based on mesh methods, such as finite element method, finite difference method, or spectral methods. Rather than classical differential equations where the differential operator is local and therefore the underlying discretized problem can be described by a sparse matrix, for fractional differential equations the operator is non-local, and consequently the discretized problem is fully coupled [28]. We believe we have found an effective approach that can surpass existing techniques, and will apply it in Task 5 to the problem of diffusion Magnetic Resonance Imaging (dMRI) [4] as well as to optimize the learning process in machine learning [37].