Using different formulations of plotting positions
==================================================

Computing plotting positions
----------------------------

When drawing a percentile, quantile, or probability plot, the potting
positions of ordered data must be computed.

For a sample :math:`X` with population size :math:`n`, the plotting
position of of the :math:`j^\mathrm{th}` element is defined as:

.. math::  \frac{x_{j} - \alpha}{n + 1 - \alpha - \beta } 

In this equation, α and β can take on several values. Common values are
described below:

    "type 4" (α=0, β=1)
        Linear interpolation of the empirical CDF.
    "type 5" or "hazen" (α=0.5, β=0.5)
        Piecewise linear interpolation.
    "type 6" or "weibull" (α=0, β=0)
        Weibull plotting positions. Unbiased exceedance probability for all distributions.
        Recommended for hydrologic applications.
    "type 7" (α=1, β=1)
        The default values in R.
        Not recommended with probability scales as the min and max data points get plotting positions of 0 and 1, respectively, and therefore cannot be shown.
    "type 8" (α=1/3, β=1/3)
        Approximately median-unbiased.
    "type 9" or "blom" (α=0.375, β=0.375)
        Approximately unbiased positions if the data are normally distributed.
    "median" (α=0.3175, β=0.3175)
        Median exceedance probabilities for all distributions (used in ``scipy.stats.probplot``).
    "apl" or "pwm" (α=0.35, β=0.35)
        Used with probability-weighted moments.
    "cunnane" (α=0.4, β=0.4)
        Nearly unbiased quantiles for normally distributed data.
        This is the default value.
    "gringorten" (α=0.44, β=0.44)
        Used for Gumble distributions.

The purpose of this tutorial is to show how the selected α and β can
alter the shape of a probability plot.

First let's get some analytical setup out of the way...

.. code:: python

    %matplotlib inline

.. code:: python

    import warnings
    warnings.simplefilter('ignore')
    
    import numpy
    from matplotlib import pyplot
    from scipy import stats
    import seaborn
    
    clear_bkgd = {'axes.facecolor':'none', 'figure.facecolor':'none'}
    seaborn.set(style='ticks', context='talk', color_codes=True, rc=clear_bkgd)
    
    import probscale
    
    
    def format_axes(ax1, ax2):
        """ Sets axes labels and grids """
        for ax in (ax1, ax2):
            if ax is not None:
                ax.set_ylim(bottom=1, top=99)
                ax.set_xlabel('Values of Data')
                seaborn.despine(ax=ax)
                ax.yaxis.grid(True)
            
        ax1.legend(loc='upper left', numpoints=1, frameon=False)
        ax1.set_ylabel('Normal Probability Scale')
        if ax2 is not None:
            ax2.set_ylabel('Weibull Probability Scale')

Normal vs Weibull scales and Cunnane vs Weibull plotting positions
------------------------------------------------------------------

Here we'll generate some fake, normally distributed data and define a
Weibull distribution from scipy to use for a probability scale.

.. code:: python

    numpy.random.seed(0)  # reproducible
    data = numpy.random.normal(loc=5, scale=1.25, size=37)
    
    # simple weibull distribution
    weibull = stats.weibull_min(2)

Now let's create probability plots on both Weibull and normal
probability scales. Additionally, we'll compute the plotting positions
two different but commone ways for each plot.

First, in blue circles, we'll show the data with Weibull (α=0, β=0)
plotting positions. Weibull plotting positions are commonly use in
fields such as hydrology and water resources engineering.

In green squares, we'll use Cunnane (α=0.4, β=0.4) plotting positions.
Cunnane plotting positions are good for normally distributed data and
are the default values.

.. code:: python

    w_opts = {'label': 'Weibull (α=0, β=0)',     'marker': 'o', 'markeredgecolor': 'b'}
    c_opts = {'label': 'Cunnane (α=0.4, β=0.4)', 'marker': 's', 'markeredgecolor': 'g'}
    
    common_opts = {
        'markerfacecolor': 'none',
        'markeredgewidth': 1.25,
        'linestyle': 'none'
    }
    
    fig, (ax1, ax2) = pyplot.subplots(figsize=(10, 8), ncols=2, sharex=True, sharey=False)
    
    for dist, ax in zip([None, weibull], [ax1, ax2]):
        for opts, postype in zip([w_opts, c_opts,], ['weibull', 'cunnane']):
            probscale.probplot(data, ax=ax, dist=dist, probax='y', 
                               scatter_kws={**opts, **common_opts}, 
                               pp_kws={'postype': postype})
    
    format_axes(ax1, ax2)
    fig.tight_layout()


.. image:: closer_look_at_plot_pos_files/output_9_0.png


This demostrates that the different formulations of the plotting
positions vary most at the extreme values of the dataset.

Hazen plotting positions
~~~~~~~~~~~~~~~~~~~~~~~~

Next, let's compare the Hazen/Type 5 (α=0.5, β=0.5) formulation to
Cunnane. Hazen plotting positions (shown as red triangles) represet a
piece-wise linear interpolation of the emperical cumulative distribution
function of the dataset.

Given the values of α and β=0.5 vary only slightly from the Cunnane
values, the plotting position predictably are similar.

.. code:: python

    h_opts = {'label': 'Hazen (α=0.5, β=0.5)', 'marker': '^', 'markeredgecolor': 'r'}
    fig, (ax1, ax2) = pyplot.subplots(figsize=(10, 8), ncols=2, sharex=True, sharey=False)
    
    for dist, ax in zip([None, weibull], [ax1, ax2]):
        for opts, postype in zip([c_opts, h_opts,], ['cunnane', 'Hazen']):
            probscale.probplot(data, ax=ax, dist=dist, probax='y', 
                               scatter_kws={**opts, **common_opts}, 
                               pp_kws={'postype': postype})
    
    format_axes(ax1, ax2)
    fig.tight_layout()


.. image:: closer_look_at_plot_pos_files/output_11_0.png


Summary
~~~~~~~

At the risk of showing a very cluttered and hard to read figure, let's
throw all three on the same normal probability scale:

.. code:: python

    fig, ax1 = pyplot.subplots(figsize=(6, 8))
    
    for opts, postype in zip([w_opts, c_opts, h_opts,], ['weibull', 'cunnane', 'hazen']):
        probscale.probplot(data, ax=ax1, dist=None, probax='y', 
                           scatter_kws={**opts, **common_opts}, 
                           pp_kws={'postype': postype})
            
    format_axes(ax1, None)
    fig.tight_layout()


.. image:: closer_look_at_plot_pos_files/output_13_0.png


Again, the different values of α and β don't significantly alter the
shape of the probability plot near between -- say -- the lower and upper
quartiles. Beyond the quartiles, however, the difference is more
obvious.

The cell below computes the plotting positions with the three sets of α
and β values that we've investigated and prints the first ten value for
easy comparison.

.. code:: python

    # weibull plotting positions and sorted data
    w_probs, _ = probscale.plot_pos(data, postype='weibull')
    
    # normal plotting positions, returned "data" is identical to above
    c_probs, _ = probscale.plot_pos(data, postype='cunnane')
    
    # type 4 plot positions
    h_probs, _ = probscale.plot_pos(data, postype='hazen')
    
    # convert to percentages
    w_probs *= 100
    c_probs *= 100
    h_probs *= 100
    
    print('Weibull: ', numpy.round(w_probs[:10], 2))
    print('Cunnane: ', numpy.round(c_probs[:10], 2))
    print('Hazen:   ', numpy.round(h_probs[:10], 2))


.. parsed-literal::

    Weibull:  [  2.63   5.26   7.89  10.53  13.16  15.79  18.42  21.05  23.68  26.32]
    Cunnane:  [  1.61   4.3    6.99   9.68  12.37  15.05  17.74  20.43  23.12  25.81]
    Hazen:    [  1.35   4.05   6.76   9.46  12.16  14.86  17.57  20.27  22.97  25.68]