NumRu::GAnalysis::Fitting

Library for function fitting

Constants

X: predefined Proc for fitting: Polynomial x (function of the 1st dim)
XX: predefined Proc for fitting: Polynomial x**2 (function of the 1st dim)
XY: predefined Proc for fitting: Polynomial x*y (function of the 1st&2nd dims)
Y: predefined Proc for fitting: Polynomial y (function of the 2nd dim)
YY: predefined Proc for fitting: Polynomial y**2 (function of the 2nd dim)

Public Instance Methods

least_square_fit(data, grid_locs, functions, ensemble_dims=nil, indep_dims=nil) click to toggle source

Least square fit of a linear combination of any functions (basic NArray version).

ARGUMENTS

data [NArray or NArrayMiss] multi-D data to fit
grid_locs [Array of 1D NArrays] Grid points of independent variables (so grid_locs.length == the # of independent variables).
functions [Array of Procs] Proc objects to represent the functions, which accept the elements of grid_locs as the arguments (so the number of arguments fed is equal to the length of grid_locs).
ensemble_dims (optional) [nil (defualt) or Array of Integers] When grid_locs.length < data.rank, this argument can be used to specify the dimensions that are not included in grid_locs and are used for ensemble averaging
indep_dims (optional) [nil (defualt) or Array of Integers] When grid_locs.length < data.rank, this argument can be used to specify the dimensions that are not included in grid_locs and are treated as independent, so the fitting is made for each of their component.

Note that the sum of the lengths of grid_locs, ensemble_dims and indep_dims must be equal to the rank (# of dims) of data.

RETURN VALUES

[ c, bf, diff ]

where

c is a NArray containing the coefficients of the functions and the constant offset; its length is one greater than the number of functions because of the offset. It is 1D unless the indep_dims argument is used (see the examples below).
bf is a NArray having the best fit grid point values. Its rank is equal to data.rank, but the lengths along ensemble_dims are simply 1.
rms of the difference between the data and best fit

EXAMPLES

Simple 1D case

Line fitting:
```
nx = 5
x = NArray.float(nx).indgen! - nx/2
data = x + x*x*0.1
c, bf = GAnalysis::Fitting.least_square_fit(data, [x], 
                                          [GAnalysis::Fitting::X])
p "data:", data, "c:", c, "bf:", bf
```
Here, +GAnalysis::Fitting::X+ is a predefined Proc to represent the first order polynomial x. The data values given as above follow f(x) = x + x**2/10. Then the result printed by the last line is
```
"data:"
NArray.float(5): 
[ -1.6, -0.9, 0.0, 1.1, 2.4 ]
"c:"
NArray.float(2): 
[ 1.0, 0.2 ]
"bf:"
NArray.float(5): 
[ -1.8, -0.8, 0.2, 1.2, 2.2 ]
```
The c values indicate that the fitting result is f(x) = 1.0*x + 0.2, and the bf values are its grid point values.

Parabolic fitting:

You can also fit the data by 2nd order polynomial as
```
c, bf = GAnalysis::Fitting.least_square_fit(data, [x], 
                  [GAnalysis::Fitting::XX,GAnalysis::Fitting::X])
```
Then the result will be
```
p c   #--> [0.1, 1.0, 0.0]
```
which indicates the original 2nd order polynomial 0.1 x**2 + x, so it follows data == bf (except for round-off error if any).
1D fitting of multi-D data (ensemble case)

Suppose you have a 2D NArray (or NArrayMiss) data, in which the 1st dim represents x and the 2nd dim represents something else (such as time sequence, or just a simple ensemble). If you want to use the entire data to get a single fit, use the ensemble_dims argument to specify the non-x dimension(s). You can fit the data, for example, by p*sin(x) + q*cos(x) + r as follows:
```
sin = proc{|x| NMath.sin(x)}
cos = proc{|x| NMath.cos(x)}
c, bf = GAnalysis::Fitting.least_square_fit(data, [x], 
     [sin, cos], [1])
```
Here, the last parameter [1] is given as the arguemnt ensemble_dims to express that the dimension 1 (2nd dimension) of data is the ensemble dimension, so the x coordinate is the remaining dimension 0 (1st dimension). The coefficients of the functions are returned by the 1st return value as a NArray, so
```
p = c[0]
q = c[1]
r = c[2]
```
1D fitting of multi-D data (individual fitting)

Suppose you have the same data as above, but you want to fit it for each of the 2nd dim elements. You can do it as follows:
```
c, bf = GAnalysis::Fitting.least_square_fit(data, [x], 
     [sin, cos], nil, [1])
```
Here, nil is given as the 4th argument (ensemble_dims) and [1] is given as the fifth (indep_dims). In this case, the return value c is 2-dimensional; the first being the coefficients as above and the second representing the non-x (i.e., the second) dim of data.
2D fitting

It can be done like
```
cosx = proc {|x,y| NMath.cos(x).newdim!(-1)}
sinx = proc {|x,y| NMath.sin(x).newdim!(-1)}
cosy = proc {|x,y| NMath.cos(y).newdim!(0)}
siny = proc {|x,y| NMath.sin(y).newdim!(0)}
c, bf = GAnalysis::Fitting.least_square_fit(data4D, [x,y], 
            [cosx, sinx, cosy, siny], [2,3])
```
where data4D is a 4D NArray, whose first and second dimensions (dimensions 0 and 1) represent x and y axis, respectively, and the 1D NArrays x and y are the grid points. Note that the functions (cosx etc) accept 2 arguments (x and y), and they use NArray’s newdim method to return 2D NArray (newdim!(-1) inserts a 1-element dim to the end, and newdim(0) inserts a 1-element dim to the beginning).

TYPICAL ERRORS

Error is raised (from the LU decomposition), if the problem cannot be solved. That happens if you specify a same function twice (redundantly) in the functions argument, as a matter of course.
Error is raised if the number of data is insuffcient for the number of functions (also unsolvable).

# File ../../lib/numru/ganalysis/fitting.rb, line 217
def least_square_fit(data, grid_locs, functions, ensemble_dims=nil,
                     indep_dims=nil)

  #< argument check >

  grid_locs.each_with_index{|x,i| self.ensure_1D_NArray(x, i)}
  functions.each{|f| raise("Found non-Proc arg") if !f.is_a?(Proc)}

  functions = functions + [@@unity]   # constanf offset

  ng = grid_locs.length
  rank = data.rank
  ensemble_dims = [ ensemble_dims ] if ensemble_dims.is_a?(Integer)
  indep_dims = [ indep_dims ] if indep_dims.is_a?(Integer)

  ensemble_dims = Array.new if ensemble_dims.nil?   # --> always an Array
  n_indep = ( indep_dims ? indep_dims.length : 0 )

  if ng < rank 
    ensemble_dims = ensemble_dims.collect{|d| 
      if d<-rank || d>=rank
        raise "Invalid ensemble_dims value (#{d}) for rank #{rank} NArray"
      end
      d += rank if d<0
      d
    }
    ensemble_dims.sort!
    if indep_dims
      indep_dims = indep_dims.collect{|d|
        if d<-rank || d>=rank
          raise "Invalid indep_dims value (#{d}) for rank #{rank} NArray"
        end
        d += rank if d<0
        d
      }
      indep_dims.sort!
    end
  elsif ng > rank
    raise "# of grid_locs (#{ng}) > data.rank (#{rank})"
  end

  if data.rank != ng + ensemble_dims.length + n_indep
      raise ArgumentError,
         "lengths of grid_locs, ensemble_dims and indep_dims != data.rank"
  end

  otherdims = ensemble_dims
  if indep_dims
    otherdims += indep_dims
    otherdims.sort!.uniq!
    if otherdims.length != ensemble_dims.length + n_indep
      raise ArgumentError, "Overlap in ensemble_dims and indep_dims"
    end
  end

  #< pre-process data >

  d0 = data.mean
  data = data - d0     # constant offset for numerical stability

  if data.is_a?(NArrayMiss)
    mask = data.get_mask
  elsif data.is_a?(NArray)
    mask = nil  # NArray.byte(*data.shape).fill!(1)
  else
    raise "Data type (#{data.class}) is not NArray or NArrayMiss"
  end

  #< derive the matrix >

  fv = functions.collect{|f| 
    f = f[*grid_locs]
    otherdims.each{|d| f.newdim!(d)}
    f
  }

  ms = fv.length    # matrix size

  if ( (len=data.length) < ms )
    raise "Insufficient data length (#{len}) for the # of funcs+1 (#{ms})"
  end

  mat = NMatrix.float(ms,ms)   #  wil be symmetric

  for i in 0...ms
    for j in 0..i
      if mask
        fvij = NArrayMiss.to_nam( fv[i] * fv[j] * mask, mask )
        mat[i,j] = (fvij).mean
      else
        mat[i,j] = (fv[i] * fv[j]).mean
      end
    end
  end

  for i in 0...ms
    for j in i+1...ms
      mat[i,j] = mat[j,i]      # symmetric
    end
  end
  #p "*** mat ***",mat
  lu = mat.lu

  #< derive the vector, solve, and best fit >

  unless indep_dims    # fitting only once
    # derive the vector
    b = NVector.float(ms)
    for i in 0...ms
      b[i] = (data * fv[i]).mean
    end

    # solve
    c = lu.solve(b)
    c[-1] += d0      # add the mean subtracted

    # convert c from NVector to NArray (just for cleanliness)
    na = NArray.float(ms)
    na[true] = c[true]
    c = na

    # best fit
    bf = c[-1]    # the constant offset
    for i in 0...ms-1
      bf += c[i]*fv[i]
    end

  else    # fitting multiple times

    # derive vectors
    idshp = indep_dims.collect{|d| data.shape[d]}
    bs = NArray.float(ms,*idshp)
    meandims = (0...rank).collect{|d| d} - indep_dims
    for i in 0...ms
      bsi = (data * fv[i]).mean(*meandims)
      if bsi.is_a?(NArrayMiss)
        if bsi.count_invalid > 0
          raise("Found invalid data everywhere along indep_dims. Trim data in advance and try again.")
        end
        bsi = bsi.to_na
      end
      bs[i,false] = bsi
    end
    idlen = 1
    idshp.each{|l| idlen *= l}

    # solve
    bs = bs.reshape(ms, idlen)
    c = NArray.float(ms,idlen)
    b = NVector.float(ms)
    for id in 0...idlen
      b[true] = bs[true,id]
      c[true,id] = lu.solve(b)
    end
    c[-1,true] += d0
    c = c.reshape(ms, *idshp)

    # best fit
    idshp_full = Array.new
    for d in 0...rank
      if indep_dims.include?(d)
        idshp_full[d] = data.shape[d]
      else
        idshp_full[d] = 1
      end
    end
    cs = c.reshape(ms, *idshp_full)
    bf = cs[-1,false]
    for i in 0...ms-1
      bf += cs[i,false]*fv[i]
    end

  end

  diff = Math.sqrt( ( (data + d0 - bf)**2 ).mean )

  #< return >

  [ c, bf, diff ]
end

Home Classes Methods

In Files