A closer look at probability plots
==================================

Overview
--------

The ``probscale.probplot`` function let's you do a couple of things.
They are:

1. Creating percentile, quantile, or probability plots.
2. Placing your probability scale either axis.
3. Specifying an arbitrary distribution for your probability scale.
4. Drawing a best-fit line line in linear-probability or log-probability
   space.
5. Computing the plotting positions of your data anyway you want.
6. Using probability axes on seaborn ``FacetGrids``

We'll go over all of these options in this tutorial.

.. code:: python

    %matplotlib inline

.. code:: python

    import warnings
    warnings.simplefilter('ignore')
    
    import numpy
    from matplotlib import pyplot
    import seaborn
    
    import probscale
    clear_bkgd = {'axes.facecolor':'none', 'figure.facecolor':'none'}
    seaborn.set(style='ticks', context='talk', color_codes=True, rc=clear_bkgd)
    
    # load up some example data from the seaborn package
    tips = seaborn.load_dataset("tips")
    iris = seaborn.load_dataset("iris")

Different plot types
--------------------

In general, there are three plot types:

1. Percentile, a.k.a. P-P plots
2. Quantile, a.k.a. Q-Q plots
3. Probability, a.k.a. Prob Plots

Percentile plots
~~~~~~~~~~~~~~~~

Percentile plots are the simplest plots. You simply plot the data
against their plotting positions. The plotting positions are shown on a
linear scale, but the data can be scaled as appropriate.

If you were doing that from scratch, it would look like this:

.. code:: python

    position, bill = probscale.plot_pos(tips['total_bill'])
    position *= 100
    fig, ax = pyplot.subplots(figsize=(6, 3))
    ax.plot(position, bill, marker='.', linestyle='none', label='Bill amount')
    ax.set_xlabel('Percentile')
    ax.set_ylabel('Total Bill (USD)')
    ax.set_yscale('log')
    ax.set_ylim(bottom=1, top=100)
    seaborn.despine()


.. image:: closer_look_at_viz_files/output_4_0.png


Using the ``probplot`` function with ``plottype='pp'``, it becomes:

.. code:: python

    fig, ax = pyplot.subplots(figsize=(6, 3))
    fig = probscale.probplot(tips['total_bill'], ax=ax, plottype='pp', datascale='log',
                             problabel='Percentile', datalabel='Total Bill (USD)',
                             scatter_kws=dict(marker='.', linestyle='none', label='Bill Amount'))
    ax.set_ylim(bottom=1, top=100)
    seaborn.despine()


.. image:: closer_look_at_viz_files/output_6_0.png


Quantile plots
~~~~~~~~~~~~~~

Quantile plots are similar to propbabilty plots. The main differences is
that plotting positions are converted into quantiles or :math:`Z`-scores
based on a probability distribution. The default distribution is the
standard-normal distribution. Using a different distribution is covered
further down.

Usings the same dataset as a above let's make a quantile plot. Like
above, we'll do it from scratch and then using ``probplot``.

.. code:: python

    from scipy import stats
    
    position, bill = probscale.plot_pos(tips['total_bill'])
    quantile = stats.norm.ppf(position)
    
    fig, ax = pyplot.subplots(figsize=(6, 3))
    ax.plot(quantile, bill, marker='.', linestyle='none', label='Bill amount')
    ax.set_xlabel('Normal Quantiles')
    ax.set_ylabel('Total Bill (USD)')
    ax.set_yscale('log')
    ax.set_ylim(bottom=1, top=100)
    seaborn.despine()


.. image:: closer_look_at_viz_files/output_8_0.png


Using ``probplot``:

.. code:: python

    fig, ax = pyplot.subplots(figsize=(6, 3))
    fig = probscale.probplot(tips['total_bill'], ax=ax, plottype='qq', datascale='log', 
                             problabel='Standard Normal Quantiles', datalabel='Total Bill (USD)',
                             scatter_kws=dict(marker='.', linestyle='none', label='Bill Amount'))
    
    ax.set_ylim(bottom=1, top=100)
    seaborn.despine()


.. image:: closer_look_at_viz_files/output_10_0.png


You'll notice that the shape of the data is straighter on the Q-Q plot
than the P-P plot. This is due to the transformation that takes place
when converting the plotting positions to a distribution's quantiles.
The plot below hopefully illustrates this more clearly. Additionally,
we'll show how use the ``probax`` option to flip the plot so that the
P-P/Q-Q/Probability axis is on the y-scale.

.. code:: python

    fig, (ax1, ax2) = pyplot.subplots(figsize=(6, 6), ncols=2, sharex=True)
    markers = dict(marker='.', linestyle='none', label='Bill Amount')
    
    fig = probscale.probplot(tips['total_bill'], ax=ax1, plottype='pp', probax='y',
                             datascale='log', problabel='Percentiles',
                             datalabel='Total Bill (USD)', scatter_kws=markers)
    
    fig = probscale.probplot(tips['total_bill'], ax=ax2, plottype='qq', probax='y',
                             datascale='log', problabel='Standard Normal Quantiles',
                             datalabel='Total Bill (USD)', scatter_kws=markers)
    
    ax1.set_xlim(left=1, right=100)
    fig.tight_layout()
    seaborn.despine()


.. image:: closer_look_at_viz_files/output_12_0.png


In these case of P-P plots and simple Q-Q plots, the ``probplot``
function doesn't offer much convencience compared to writing raw
matplotlib commands. However, this changes when you start making
probability plots and using more advanced options.

Probability plots
~~~~~~~~~~~~~~~~~

Visually, the curve of plots on probability and quantile scales should
be the same. The difference is that the axis ticks are placed and
labeled based on non-exceedance probailities rather than the more
abstract quantiles of the distribution.

Unsurprisingly, a picture explains this much better. Let's build off of
the previos plot:

.. code:: python

    fig, (ax1, ax2, ax3) = pyplot.subplots(figsize=(9, 6), ncols=3, sharex=True)
    common_opts = dict(
        probax='y', 
        datascale='log', 
        datalabel='Total Bill (USD)',
        scatter_kws=dict(marker='.', linestyle='none')
    )
    
    fig = probscale.probplot(tips['total_bill'], ax=ax1, plottype='pp',
                             problabel='Percentiles',  **common_opts)
    
    fig = probscale.probplot(tips['total_bill'], ax=ax2, plottype='qq',
                             problabel='Standard Normal Quantiles',  **common_opts)
    
    fig = probscale.probplot(tips['total_bill'], ax=ax3, plottype='prob',
                             problabel='Standard Normal Probabilities',  **common_opts)
    
    ax3.set_xlim(left=1, right=100)
    ax3.set_ylim(bottom=0.13, top=99.87)
    fig.tight_layout()
    seaborn.despine()


.. image:: closer_look_at_viz_files/output_14_0.png


Visually, shapes of the curves on the right-most plots are identical.
The difference is that the y-axis ticks and labels are more "human"
readable.

In other words, the probability (right) axis gives us the ease of
finding e.g. the 75th percentile found on percentile (left) axis, and
illustrates how well the data fit a given distribution like the quantile
(middle) axes.

Using different distributions for your scales
---------------------------------------------

When using quantile or probability scales, you can pass a distribution
from the ``scipy.stats`` module to the ``probplot`` function. When a
distribution is not provided to the ``dist`` parameter, a standard
normal distribution is used.

.. code:: python

    common_opts = dict(
        plottype='prob',
        probax='y',
        datascale='log',
        datalabel='Total Bill (USD)',
        scatter_kws=dict(marker='+', linestyle='none', mew=1)
    )
    
    alpha = stats.alpha(10)
    beta = stats.beta(6, 3)
    
    fig, (ax1, ax2, ax3) = pyplot.subplots(figsize=(9, 6), ncols=3, sharex=True)
    fig = probscale.probplot(tips['total_bill'], ax=ax1, dist=alpha, 
                             problabel='Alpha(10) Probabilities', **common_opts)
    
    fig = probscale.probplot(tips['total_bill'], ax=ax2, dist=beta, 
                             problabel='Beta(6, 1) Probabilities', **common_opts)
    
    fig = probscale.probplot(tips['total_bill'], ax=ax3, dist=None,
                             problabel='Standard Normal Probabilities', **common_opts)
    
    ax3.set_xlim(left=1, right=100)
    for ax in [ax1, ax2, ax3]:
        ax.set_ylim(bottom=0.2, top=99.8)
    seaborn.despine()
    fig.tight_layout()


.. image:: closer_look_at_viz_files/output_16_0.png


This can also be done for QQ scales:

.. code:: python

    common_opts = dict(
        plottype='qq',
        probax='y',
        datascale='log',
        datalabel='Total Bill (USD)',
        scatter_kws=dict(marker='+', linestyle='none', mew=1)
    )
    
    alpha = stats.alpha(10)
    beta = stats.beta(6, 3)
    
    fig, (ax1, ax2, ax3) = pyplot.subplots(figsize=(9, 6), ncols=3, sharex=True)
    fig = probscale.probplot(tips['total_bill'], ax=ax1, dist=alpha, 
                             problabel='Alpha(10) Quantiles', **common_opts)
    
    fig = probscale.probplot(tips['total_bill'], ax=ax2, dist=beta, 
                             problabel='Beta(6, 3) Quantiles', **common_opts)
    
    fig = probscale.probplot(tips['total_bill'], ax=ax3, dist=None,
                             problabel='Standard Normal Quantiles', **common_opts)
    
    ax1.set_xlim(left=1, right=100)
    seaborn.despine()
    fig.tight_layout()


.. image:: closer_look_at_viz_files/output_18_0.png


Using a specific distribution with a quantile scale can give us an idea
of how well the data fit that distribution. For instance, let's say we
have a hunch that the values of the ``total_bill`` column in our dataset
are normally distributed and their mean and standard deviation are 19.8
and 8.9, respectively. We could investigate that by create a
``scipy.stat.norm`` distribution with those parameters and use that
distribution in the Q-Q plot.

.. code:: python

    def equality_line(ax, label=None):
        limits = [
            numpy.min([ax.get_xlim(), ax.get_ylim()]),
            numpy.max([ax.get_xlim(), ax.get_ylim()]),
        ]
        ax.set_xlim(limits)
        ax.set_ylim(limits)
        ax.plot(limits, limits, 'k-', alpha=0.75, zorder=0, label=label)
    
    norm = stats.norm(loc=21, scale=8)
    fig, ax = pyplot.subplots(figsize=(5, 5))
    ax.set_aspect('equal')
    
    common_opts = dict(
        plottype='qq',
        probax='x',
        problabel='Theoretical Quantiles',
        datalabel='Emperical Quantiles',
        scatter_kws=dict(label='Bill amounts')
    )
    
    fig = probscale.probplot(tips['total_bill'], ax=ax, dist=norm, **common_opts)
    
    equality_line(ax, label='Guessed Normal Distribution')
    ax.legend(loc='lower right')
    seaborn.despine()


.. image:: closer_look_at_viz_files/output_20_0.png


Hmm. That doesn't look too good. Let's use scipy's fitting functionality
to try out a lognormal distribution.

.. code:: python

    lognorm_params = stats.lognorm.fit(tips['total_bill'], floc=0)
    lognorm = stats.lognorm(*lognorm_params)
    fig, ax = pyplot.subplots(figsize=(5, 5))
    ax.set_aspect('equal')
    
    fig = probscale.probplot(tips['total_bill'], ax=ax, dist=lognorm, **common_opts)
    
    equality_line(ax, label='Fit Lognormal Distribution')
    ax.legend(loc='lower right')
    seaborn.despine()


.. image:: closer_look_at_viz_files/output_22_0.png


That's a little bit better.

Finding the best distribution is left as an exercise to the reader.

Best-fit lines
--------------

Adding a best-fit line to a probability plot can provide insight as to
whether or not a dataset can be characterized by a distribution.

This is simply done with the ``bestfit=True`` option in ``probplot``.
Behind the scenes, ``probplot`` transforms both the x- and y-data of fed
to the regression based on the plot type and scale of the data axis
(controlled via ``datascale``).

Visual attributes of the line can be controled with the ``line_kws``
parameter. If you want label the best-fit line, that is where you
specify its label.

Simple examples
~~~~~~~~~~~~~~~

The most trivial case is a P-P plot with a linear data axis

.. code:: python

    fig, ax = pyplot.subplots(figsize=(6, 3))
    fig = probscale.probplot(tips['total_bill'], ax=ax, plottype='pp', bestfit=True,
                             problabel='Percentile', datalabel='Total Bill (USD)',
                             scatter_kws=dict(label='Bill Amount'),
                             line_kws=dict(label='Best-fit line'))
    ax.legend(loc='upper left')
    seaborn.despine()


.. image:: closer_look_at_viz_files/output_25_0.png


The least trivial case is a probability plot with a log-scaled data
axes.

As suggested by the section on quantile plots with custom distributions,
using a normal probability scale with a lognormal data scale provides a
decent fit (visually speaking).

Note that you still put the probability scale on either the x- or
y-axis.

.. code:: python

    fig, ax = pyplot.subplots(figsize=(4, 6))
    fig = probscale.probplot(tips['total_bill'], ax=ax, plottype='prob', probax='y', bestfit=True, 
                             datascale='log', problabel='Probabilities', datalabel='Total Bill (USD)',
                             scatter_kws=dict(label='Bill Amount'),
                             line_kws=dict(label='Best-fit line'))
    ax.legend(loc='upper left')
    ax.set_ylim(bottom=0.1, top=99.9)
    ax.set_xlim(left=1, right=100)
    seaborn.despine()


.. image:: closer_look_at_viz_files/output_27_0.png


Bootstrapped confidence intervals
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Regardless of the scales of the plot (linear, log, or prob), you can add
bootstrapped confidence intervals around the best-fit line. Simply use
the ``estimate_ci=True`` option along with ``bestfit=True``:

.. code:: python

    N = 15
    numpy.random.seed(0)
    x = numpy.random.normal(size=N) + numpy.random.uniform(size=N)
    fig, ax = pyplot.subplots(figsize=(8, 4))
    fig = probscale.probplot(x, ax=ax, bestfit=True, estimate_ci=True, 
                             line_kws={'label': 'BF Line', 'color': 'b'},
                             scatter_kws={'label': 'Observations'},
                             problabel='Probability (%)')
    ax.legend(loc='lower right')
    ax.set_ylim(bottom=-2, top=4)
    seaborn.despine(fig)


.. image:: closer_look_at_viz_files/output_29_0.png


Tuning the plotting positions
-----------------------------

The ``probplot`` function calls the :func:`viz.plot_plos` function to compute each dataset's plotting positions.

You should read that function's docstring for more detailed information.
But the high-level overview is that there are a couple of parameters (``alpha`` and ``beta``) that you can tweak in the plotting positions calculation.

The most common values can be selected via the ``postype`` parameter.

These are controlled via the `pp_kws` parameter in `probplot`
and are discussed in much more detail in the `next tutorial <closer_look_at_plot_pos.html>`_.

.. code:: python

    common_opts = dict(
        plottype='prob',
        probax='x',
        datalabel='Data',
    )
    
    numpy.random.seed(0)
    x = numpy.random.normal(size=15)
    
    fig, (ax1, ax2, ax3) = pyplot.subplots(figsize=(6, 6), nrows=3,
                                           sharey=True, sharex=True)
    fig = probscale.probplot(x, ax=ax1, problabel='Cunnuane (default) plotting positions',
                             **common_opts)
    
    fig = probscale.probplot(x, ax=ax2, problabel='Weibull plotting positions',
                             pp_kws=dict(postype='weibull'), **common_opts)
    
    fig = probscale.probplot(x, ax=ax3, problabel='Custom plotting positions',
                             pp_kws=dict(alpha=0.6, beta=0.1), **common_opts)
    ax1.set_xlim(left=1, right=99)
    seaborn.despine()
    fig.tight_layout()


.. image:: closer_look_at_viz_files/output_32_0.png


Controlling the aesthetics of the plot elements
-----------------------------------------------

As it has been hinted in the examples above, the ``probplot`` function
takes two dictionaries to customize the data series and the best-fit
line (``scatter_kws`` and ``line_kws``, respectively. These dictionaries
are passed directly to the ``plot`` method of current axes.

By default, the data series assumes that ``linestyle='none'`` and
``marker='o'``. These can be overwritten through ``scatter_kws``

Revisting the previous example, we can customize it like so:

.. code:: python

    scatter_options = dict(
        marker='^',
        markerfacecolor='none',
        markeredgecolor='firebrick',
        markeredgewidth=1.25,
        linestyle='none',
        alpha=0.35,
        zorder=5,
        label='Meal Cost ($)'
    )
    
    line_options = dict(
        dashes=(10,2,5,2,10,2),
        color='0.25',
        linewidth=3,
        zorder=10,
        label='Best-fit line'
    ) 
    
    fig, ax = pyplot.subplots(figsize=(4, 6))
    fig = probscale.probplot(tips['total_bill'], ax=ax, plottype='prob', probax='y', bestfit=True, 
                             datascale='log', problabel='Probabilities', datalabel='Total Bill (USD)',
                             scatter_kws=scatter_options, line_kws=line_options)
    ax.legend(loc='upper left')
    ax.set_ylim(bottom=0.1, top=99.9)
    seaborn.despine()


.. image:: closer_look_at_viz_files/output_34_0.png


.. note::
   The ``probplot`` function can take two additional aesthetic parameters:
   `color` and `label`. If provided, `color` will override the marker face color
   and line color options of the `scatter_kws` and `line_kws` parameters, respectively.
   Similarly, the label of the scatter series will be overridden by the explicit parameter.
   It is not recommended that `color` and `label` are used. They exist primarily for 
   compatibility with the seaborn package.

Mapping probability plots to seaborn `FacetGrids <http://stanford.edu/~mwaskom/software/seaborn/tutorial/axis_grids.html#subsetting-data-with-facetgrid>`__
-----------------------------------------------------------------------------------------------------------------------------------------------------------

In general, ``probplot`` was written with ``FacetGrids`` in mind. All
you need to do is specify the data column and other options in the call
to ``FacetGrid.map``.

Unfortunately the labels don't work out exactly like I want, but it's a
work in progress.

.. code:: python

    fg = (
        seaborn.FacetGrid(data=iris, hue='species', aspect=2)
            .map(probscale.probplot, 'sepal_length')
            .set_axis_labels(x_var='Probability', y_var='Sepal Length')
            .add_legend()
    )


.. image:: closer_look_at_viz_files/output_37_0.png


.. code:: python

    fg = (
        seaborn.FacetGrid(data=iris, hue='species', aspect=2)
            .map(probscale.probplot, 'petal_length', plottype='qq', probax='y')
            .set_ylabels('Quantiles')
            .add_legend()
    )


.. image:: closer_look_at_viz_files/output_38_0.png


.. code:: python

    fg = (
        seaborn.FacetGrid(data=tips, hue='sex', row='smoker', col='time', margin_titles=True, size=4)
            .map(probscale.probplot, 'total_bill', probax='y', bestfit=True)
            .set_ylabels('Probability')
            .add_legend()
    )


.. image:: closer_look_at_viz_files/output_39_0.png