A Quick LAPACK Example
Posted on 2017-05-19
With the release of Simply Fortran 2.37, LAPACK and BLAS routines are now easily available on all supported platforms. LAPACK, or Linear Algebra Package, provides an enormous collection of subroutines for dealing with many types of matrix operations and linear systems , while BLAS, or Basic Linear Algebra Subprograms, provides lower level vector and matrix math operations. Both libraries are extremely common within scientific computing, and their inclusion with Simply Fortran should improve productivity for many of our users.
This quick example will look at solving an overdetermined system via a least-squares regression. LAPACK provides a number of routines supporting least squares operation, and this example uses SGELS, a single-precision generalized least squares solver. First, though, we'll need some sample data...
For those familiar with the sport of soccer (or football outside of the United States of America), one might expect that a correlation exists between the number of shots taken and the number of goals scored by any given player. It would be impossible to score, in fact, if no shots are taken. With the proper statistics, we can actually estimate the the percentage of goals resulting from a single player's total number of shots.
The National Womens Soccer League, or NWSL, is a professional soccer league within the United States, and it started reliably tracking player statistics during its 2016 season. The data is readily available here. Amongst the plethora of data are two sets we would need, shots taken and goals scored by each player. With some minor spreadsheet manipulation, we can build a single text file that contains the number of shots taken by each player and the corresponding number of goals scored by said player over the course of the NWSL 2016 season. We can eliminate players who did not take any shots since they don't really conform to our analysis. The simple data file can be downloaded here: nwsl2016.txt
We first need a simple subroutine to load the data into two arrays of type REAL for LAPACK to analyze. Our routine will also assign a variable with the number of data points to allow dimensioning some additional arrays before our analysis begins. The variables n, totalshots, and goals are defined in the main program, but are within the scope of this subroutine:
! Simply loads the NWSL data from a text file subroutine loaddata() implicit none integer::row_shots, row_goals integer::i open(unit=100, file='nwsl2016.txt', status='old') ! Two header rows read(100, *) read(100, *) ! First number is the number of data points read(100, *) n ! Allocate data arrays allocate(totalshots(n), goals(n)) ! Load each data point as integers and re-store ! them in our arrays as REAL values do i = 1, n read(100, *) row_shots, row_goals totalshots(i) = real(row_shots) goals(i) = real(row_goals) end do close(100) end subroutine loaddata
The loading routine is not particularly interesting, so we can continue to our actual least squares analysis with the data properly loaded.
To calculate our least squares fit, we need to configure two major arrays. The overdetermined system we're attempting to solve is defined simply as:
A*x = b
Our A matrix will contain the total shots taken data. Since we also want to calculate a y-intercept in this linear regression, though, our A matrix should have two columns and n rows. The first column of A, in order to properly sove for the intercept, should be entirely set to 1.0. We also need a b matrix containing our goals scored data. This matrix will only have a single column, but we need to create a new array rather than using our original data array directly. The SGELS subroutine will modify the data in both A and b in the course of its operations, so it's often desirable to create arrays just for this task, leaving the original data untouched. Our program is now:
program shots use aplot implicit none integer::n real, dimension(:), allocatable::totalshots, goals ! Arrays for calling LAPACK's SGELS driver real, dimension(:,:), allocatable::bdata, adata real, dimension(:), allocatable::work integer::info ! First, load all our data call loaddata() ! The 'n' variable now holds the number of data points, ! and our totalshots and goals arrays should be properly ! dimensioned and populated. ! Build arrays for LAPACK calls allocate(adata(n,2), bdata(n,1), work(2*n)) adata(:,1) = 1.0 ! So an intercept is calculated adata(:,2) = totalshots ! Total shots bdata(:,1) = goals ! Goals
The work array is simply necessary to call SGELS.
The call to SGELS itself can appear daunting, but we've organized our inputs for this simple problem relatively well. The SGELS call becomes:
! Perform a least squares fit - documentation at: call SGELS('N', & ! TRANS n, & ! M 2, & ! N 1, & ! NRHS adata, & ! A n, & ! LDA bdata, & ! B n, & ! LDB work, & ! WORK 2*n, & ! LWORK info)
The first input, 'N', specifies that the arrays are normal as opposed to transposed. The remaining scalar inputs basically deal with the dimensions of our arrays. While we might be able to explain each, the LAPACK documentation does a far superior job.
If all goes according to plan, the value after calling SGELS should be zero. We can add some print statements to generate some output as well:
Print *, "Result (0=success): ", info Print *, "Intercept: ", bdata(1,1) Print *, "Slope: ", bdata(2,1)
Before we try to compile this program, we need to enable LAPACK and BLAS linking. In the Project Options window under the "Linker" tab, there is a new option to enable LAPACK and BLAS linking:
After enabling this option and saving the project, this program should run just fine, producing the following output:
Result (0=success): 0 Intercept: -0.338190973 Slope: 0.126037121
Based on this data, about 13% of shots taken result in a goal. The negative intercept reflects the fact that, just because a shot was taken, a goal isn't garaunteed. Multiple players have numerous shots without ever having scored.
We haven't really considered the "quality" of this linear regression by computing an R^2 value or computing t-scores for our two coefficients. Since this task is merely a demonstration, though, we might be more interested in plotting the resulting linear regression versus actual data. Luckily, Simply Fortran includes the Aplot plotting library for just such a reason. While we already have our original data, we'll need to compute regression data from our results:
type(aplot_t)::p integer::i real, dimension(:), allocatable::fitdata ... ! Generate regression data for each totalshots point allocate(fitdata(n)) do i = 1, n fitdata(i) = bdata(1,1) + totalshots(i)*bdata(2,1) end do
Plotting both on the same plot is straightforward:
! Plot the results p = initialize_plot() call add_dataset(p, totalshots, goals) call set_seriestype(p, 0, APLOT_STYLE_DOT) call set_serieslabel(p, 0, "Player Data") call add_dataset(p, totalshots, fitdata) call set_seriestype(p, 1, APLOT_STYLE_LINE) call set_serieslabel(p, 1, "Linear Fit") call set_xlabel(p, "Total Shots") call set_ylabel(p, "Goals Scored") call set_title(p, "NWSL 2016: Total Shots vs. Goals Scored") call display_plot(p) call destroy_plot(p)
After adding this to our program, we can now visualize our little analysis:
This little tutorial is meant to demonstrate the presence of LAPACK within Simply Fortran's latest version. Everything needed to replicate this is linked below: