A closer look at probability plots ================================== Overview -------- The ``probscale.probplot`` function let's you do a couple of things. They are: 1. Creating percentile, quantile, or probability plots. 2. Placing your probability scale either axis. 3. Specifying an arbitrary distribution for your probability scale. 4. Drawing a best-fit line line in linear-probability or log-probability space. 5. Computing the plotting positions of your data anyway you want. 6. Using probability axes on seaborn ``FacetGrids`` We'll go over all of these options in this tutorial. .. code:: python %matplotlib inline .. code:: python import warnings warnings.simplefilter('ignore') import numpy from matplotlib import pyplot import seaborn import probscale clear_bkgd = {'axes.facecolor':'none', 'figure.facecolor':'none'} seaborn.set(style='ticks', context='talk', color_codes=True, rc=clear_bkgd) # load up some example data from the seaborn package tips = seaborn.load_dataset("tips") iris = seaborn.load_dataset("iris") Different plot types -------------------- In general, there are three plot types: 1. Percentile, a.k.a. P-P plots 2. Quantile, a.k.a. Q-Q plots 3. Probability, a.k.a. Prob Plots Percentile plots ~~~~~~~~~~~~~~~~ Percentile plots are the simplest plots. You simply plot the data against their plotting positions. The plotting positions are shown on a linear scale, but the data can be scaled as appropriate. If you were doing that from scratch, it would look like this: .. code:: python position, bill = probscale.plot_pos(tips['total_bill']) position *= 100 fig, ax = pyplot.subplots(figsize=(6, 3)) ax.plot(position, bill, marker='.', linestyle='none', label='Bill amount') ax.set_xlabel('Percentile') ax.set_ylabel('Total Bill (USD)') ax.set_yscale('log') ax.set_ylim(bottom=1, top=100) seaborn.despine() .. image:: closer_look_at_viz_files/output_4_0.png Using the ``probplot`` function with ``plottype='pp'``, it becomes: .. code:: python fig, ax = pyplot.subplots(figsize=(6, 3)) fig = probscale.probplot(tips['total_bill'], ax=ax, plottype='pp', datascale='log', problabel='Percentile', datalabel='Total Bill (USD)', scatter_kws=dict(marker='.', linestyle='none', label='Bill Amount')) ax.set_ylim(bottom=1, top=100) seaborn.despine() .. image:: closer_look_at_viz_files/output_6_0.png Quantile plots ~~~~~~~~~~~~~~ Quantile plots are similar to propbabilty plots. The main differences is that plotting positions are converted into quantiles or :math:`Z`-scores based on a probability distribution. The default distribution is the standard-normal distribution. Using a different distribution is covered further down. Usings the same dataset as a above let's make a quantile plot. Like above, we'll do it from scratch and then using ``probplot``. .. code:: python from scipy import stats position, bill = probscale.plot_pos(tips['total_bill']) quantile = stats.norm.ppf(position) fig, ax = pyplot.subplots(figsize=(6, 3)) ax.plot(quantile, bill, marker='.', linestyle='none', label='Bill amount') ax.set_xlabel('Normal Quantiles') ax.set_ylabel('Total Bill (USD)') ax.set_yscale('log') ax.set_ylim(bottom=1, top=100) seaborn.despine() .. image:: closer_look_at_viz_files/output_8_0.png Using ``probplot``: .. code:: python fig, ax = pyplot.subplots(figsize=(6, 3)) fig = probscale.probplot(tips['total_bill'], ax=ax, plottype='qq', datascale='log', problabel='Standard Normal Quantiles', datalabel='Total Bill (USD)', scatter_kws=dict(marker='.', linestyle='none', label='Bill Amount')) ax.set_ylim(bottom=1, top=100) seaborn.despine() .. image:: closer_look_at_viz_files/output_10_0.png You'll notice that the shape of the data is straighter on the Q-Q plot than the P-P plot. This is due to the transformation that takes place when converting the plotting positions to a distribution's quantiles. The plot below hopefully illustrates this more clearly. Additionally, we'll show how use the ``probax`` option to flip the plot so that the P-P/Q-Q/Probability axis is on the y-scale. .. code:: python fig, (ax1, ax2) = pyplot.subplots(figsize=(6, 6), ncols=2, sharex=True) markers = dict(marker='.', linestyle='none', label='Bill Amount') fig = probscale.probplot(tips['total_bill'], ax=ax1, plottype='pp', probax='y', datascale='log', problabel='Percentiles', datalabel='Total Bill (USD)', scatter_kws=markers) fig = probscale.probplot(tips['total_bill'], ax=ax2, plottype='qq', probax='y', datascale='log', problabel='Standard Normal Quantiles', datalabel='Total Bill (USD)', scatter_kws=markers) ax1.set_xlim(left=1, right=100) fig.tight_layout() seaborn.despine() .. image:: closer_look_at_viz_files/output_12_0.png In these case of P-P plots and simple Q-Q plots, the ``probplot`` function doesn't offer much convencience compared to writing raw matplotlib commands. However, this changes when you start making probability plots and using more advanced options. Probability plots ~~~~~~~~~~~~~~~~~ Visually, the curve of plots on probability and quantile scales should be the same. The difference is that the axis ticks are placed and labeled based on non-exceedance probailities rather than the more abstract quantiles of the distribution. Unsurprisingly, a picture explains this much better. Let's build off of the previos plot: .. code:: python fig, (ax1, ax2, ax3) = pyplot.subplots(figsize=(9, 6), ncols=3, sharex=True) common_opts = dict( probax='y', datascale='log', datalabel='Total Bill (USD)', scatter_kws=dict(marker='.', linestyle='none') ) fig = probscale.probplot(tips['total_bill'], ax=ax1, plottype='pp', problabel='Percentiles', **common_opts) fig = probscale.probplot(tips['total_bill'], ax=ax2, plottype='qq', problabel='Standard Normal Quantiles', **common_opts) fig = probscale.probplot(tips['total_bill'], ax=ax3, plottype='prob', problabel='Standard Normal Probabilities', **common_opts) ax3.set_xlim(left=1, right=100) ax3.set_ylim(bottom=0.13, top=99.87) fig.tight_layout() seaborn.despine() .. image:: closer_look_at_viz_files/output_14_0.png Visually, shapes of the curves on the right-most plots are identical. The difference is that the y-axis ticks and labels are more "human" readable. In other words, the probability (right) axis gives us the ease of finding e.g. the 75th percentile found on percentile (left) axis, and illustrates how well the data fit a given distribution like the quantile (middle) axes. Using different distributions for your scales --------------------------------------------- When using quantile or probability scales, you can pass a distribution from the ``scipy.stats`` module to the ``probplot`` function. When a distribution is not provided to the ``dist`` parameter, a standard normal distribution is used. .. code:: python common_opts = dict( plottype='prob', probax='y', datascale='log', datalabel='Total Bill (USD)', scatter_kws=dict(marker='+', linestyle='none', mew=1) ) alpha = stats.alpha(10) beta = stats.beta(6, 3) fig, (ax1, ax2, ax3) = pyplot.subplots(figsize=(9, 6), ncols=3, sharex=True) fig = probscale.probplot(tips['total_bill'], ax=ax1, dist=alpha, problabel='Alpha(10) Probabilities', **common_opts) fig = probscale.probplot(tips['total_bill'], ax=ax2, dist=beta, problabel='Beta(6, 1) Probabilities', **common_opts) fig = probscale.probplot(tips['total_bill'], ax=ax3, dist=None, problabel='Standard Normal Probabilities', **common_opts) ax3.set_xlim(left=1, right=100) for ax in [ax1, ax2, ax3]: ax.set_ylim(bottom=0.2, top=99.8) seaborn.despine() fig.tight_layout() .. image:: closer_look_at_viz_files/output_16_0.png This can also be done for QQ scales: .. code:: python common_opts = dict( plottype='qq', probax='y', datascale='log', datalabel='Total Bill (USD)', scatter_kws=dict(marker='+', linestyle='none', mew=1) ) alpha = stats.alpha(10) beta = stats.beta(6, 3) fig, (ax1, ax2, ax3) = pyplot.subplots(figsize=(9, 6), ncols=3, sharex=True) fig = probscale.probplot(tips['total_bill'], ax=ax1, dist=alpha, problabel='Alpha(10) Quantiles', **common_opts) fig = probscale.probplot(tips['total_bill'], ax=ax2, dist=beta, problabel='Beta(6, 3) Quantiles', **common_opts) fig = probscale.probplot(tips['total_bill'], ax=ax3, dist=None, problabel='Standard Normal Quantiles', **common_opts) ax1.set_xlim(left=1, right=100) seaborn.despine() fig.tight_layout() .. image:: closer_look_at_viz_files/output_18_0.png Using a specific distribution with a quantile scale can give us an idea of how well the data fit that distribution. For instance, let's say we have a hunch that the values of the ``total_bill`` column in our dataset are normally distributed and their mean and standard deviation are 19.8 and 8.9, respectively. We could investigate that by create a ``scipy.stat.norm`` distribution with those parameters and use that distribution in the Q-Q plot. .. code:: python def equality_line(ax, label=None): limits = [ numpy.min([ax.get_xlim(), ax.get_ylim()]), numpy.max([ax.get_xlim(), ax.get_ylim()]), ] ax.set_xlim(limits) ax.set_ylim(limits) ax.plot(limits, limits, 'k-', alpha=0.75, zorder=0, label=label) norm = stats.norm(loc=21, scale=8) fig, ax = pyplot.subplots(figsize=(5, 5)) ax.set_aspect('equal') common_opts = dict( plottype='qq', probax='x', problabel='Theoretical Quantiles', datalabel='Emperical Quantiles', scatter_kws=dict(label='Bill amounts') ) fig = probscale.probplot(tips['total_bill'], ax=ax, dist=norm, **common_opts) equality_line(ax, label='Guessed Normal Distribution') ax.legend(loc='lower right') seaborn.despine() .. image:: closer_look_at_viz_files/output_20_0.png Hmm. That doesn't look too good. Let's use scipy's fitting functionality to try out a lognormal distribution. .. code:: python lognorm_params = stats.lognorm.fit(tips['total_bill'], floc=0) lognorm = stats.lognorm(*lognorm_params) fig, ax = pyplot.subplots(figsize=(5, 5)) ax.set_aspect('equal') fig = probscale.probplot(tips['total_bill'], ax=ax, dist=lognorm, **common_opts) equality_line(ax, label='Fit Lognormal Distribution') ax.legend(loc='lower right') seaborn.despine() .. image:: closer_look_at_viz_files/output_22_0.png That's a little bit better. Finding the best distribution is left as an exercise to the reader. Best-fit lines -------------- Adding a best-fit line to a probability plot can provide insight as to whether or not a dataset can be characterized by a distribution. This is simply done with the ``bestfit=True`` option in ``probplot``. Behind the scenes, ``probplot`` transforms both the x- and y-data of fed to the regression based on the plot type and scale of the data axis (controlled via ``datascale``). Visual attributes of the line can be controled with the ``line_kws`` parameter. If you want label the best-fit line, that is where you specify its label. Simple examples ~~~~~~~~~~~~~~~ The most trivial case is a P-P plot with a linear data axis .. code:: python fig, ax = pyplot.subplots(figsize=(6, 3)) fig = probscale.probplot(tips['total_bill'], ax=ax, plottype='pp', bestfit=True, problabel='Percentile', datalabel='Total Bill (USD)', scatter_kws=dict(label='Bill Amount'), line_kws=dict(label='Best-fit line')) ax.legend(loc='upper left') seaborn.despine() .. image:: closer_look_at_viz_files/output_25_0.png The least trivial case is a probability plot with a log-scaled data axes. As suggested by the section on quantile plots with custom distributions, using a normal probability scale with a lognormal data scale provides a decent fit (visually speaking). Note that you still put the probability scale on either the x- or y-axis. .. code:: python fig, ax = pyplot.subplots(figsize=(4, 6)) fig = probscale.probplot(tips['total_bill'], ax=ax, plottype='prob', probax='y', bestfit=True, datascale='log', problabel='Probabilities', datalabel='Total Bill (USD)', scatter_kws=dict(label='Bill Amount'), line_kws=dict(label='Best-fit line')) ax.legend(loc='upper left') ax.set_ylim(bottom=0.1, top=99.9) ax.set_xlim(left=1, right=100) seaborn.despine() .. image:: closer_look_at_viz_files/output_27_0.png Bootstrapped confidence intervals ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Regardless of the scales of the plot (linear, log, or prob), you can add bootstrapped confidence intervals around the best-fit line. Simply use the ``estimate_ci=True`` option along with ``bestfit=True``: .. code:: python N = 15 numpy.random.seed(0) x = numpy.random.normal(size=N) + numpy.random.uniform(size=N) fig, ax = pyplot.subplots(figsize=(8, 4)) fig = probscale.probplot(x, ax=ax, bestfit=True, estimate_ci=True, line_kws={'label': 'BF Line', 'color': 'b'}, scatter_kws={'label': 'Observations'}, problabel='Probability (%)') ax.legend(loc='lower right') ax.set_ylim(bottom=-2, top=4) seaborn.despine(fig) .. image:: closer_look_at_viz_files/output_29_0.png Tuning the plotting positions ----------------------------- The ``probplot`` function calls the :func:`viz.plot_plos` function to compute each dataset's plotting positions. You should read that function's docstring for more detailed information. But the high-level overview is that there are a couple of parameters (``alpha`` and ``beta``) that you can tweak in the plotting positions calculation. The most common values can be selected via the ``postype`` parameter. These are controlled via the `pp_kws` parameter in `probplot` and are discussed in much more detail in the `next tutorial `_. .. code:: python common_opts = dict( plottype='prob', probax='x', datalabel='Data', ) numpy.random.seed(0) x = numpy.random.normal(size=15) fig, (ax1, ax2, ax3) = pyplot.subplots(figsize=(6, 6), nrows=3, sharey=True, sharex=True) fig = probscale.probplot(x, ax=ax1, problabel='Cunnuane (default) plotting positions', **common_opts) fig = probscale.probplot(x, ax=ax2, problabel='Weibull plotting positions', pp_kws=dict(postype='weibull'), **common_opts) fig = probscale.probplot(x, ax=ax3, problabel='Custom plotting positions', pp_kws=dict(alpha=0.6, beta=0.1), **common_opts) ax1.set_xlim(left=1, right=99) seaborn.despine() fig.tight_layout() .. image:: closer_look_at_viz_files/output_32_0.png Controlling the aesthetics of the plot elements ----------------------------------------------- As it has been hinted in the examples above, the ``probplot`` function takes two dictionaries to customize the data series and the best-fit line (``scatter_kws`` and ``line_kws``, respectively. These dictionaries are passed directly to the ``plot`` method of current axes. By default, the data series assumes that ``linestyle='none'`` and ``marker='o'``. These can be overwritten through ``scatter_kws`` Revisting the previous example, we can customize it like so: .. code:: python scatter_options = dict( marker='^', markerfacecolor='none', markeredgecolor='firebrick', markeredgewidth=1.25, linestyle='none', alpha=0.35, zorder=5, label='Meal Cost ($)' ) line_options = dict( dashes=(10,2,5,2,10,2), color='0.25', linewidth=3, zorder=10, label='Best-fit line' ) fig, ax = pyplot.subplots(figsize=(4, 6)) fig = probscale.probplot(tips['total_bill'], ax=ax, plottype='prob', probax='y', bestfit=True, datascale='log', problabel='Probabilities', datalabel='Total Bill (USD)', scatter_kws=scatter_options, line_kws=line_options) ax.legend(loc='upper left') ax.set_ylim(bottom=0.1, top=99.9) seaborn.despine() .. image:: closer_look_at_viz_files/output_34_0.png .. note:: The ``probplot`` function can take two additional aesthetic parameters: `color` and `label`. If provided, `color` will override the marker face color and line color options of the `scatter_kws` and `line_kws` parameters, respectively. Similarly, the label of the scatter series will be overridden by the explicit parameter. It is not recommended that `color` and `label` are used. They exist primarily for compatibility with the seaborn package. Mapping probability plots to seaborn `FacetGrids `__ ----------------------------------------------------------------------------------------------------------------------------------------------------------- In general, ``probplot`` was written with ``FacetGrids`` in mind. All you need to do is specify the data column and other options in the call to ``FacetGrid.map``. Unfortunately the labels don't work out exactly like I want, but it's a work in progress. .. code:: python fg = ( seaborn.FacetGrid(data=iris, hue='species', aspect=2) .map(probscale.probplot, 'sepal_length') .set_axis_labels(x_var='Probability', y_var='Sepal Length') .add_legend() ) .. image:: closer_look_at_viz_files/output_37_0.png .. code:: python fg = ( seaborn.FacetGrid(data=iris, hue='species', aspect=2) .map(probscale.probplot, 'petal_length', plottype='qq', probax='y') .set_ylabels('Quantiles') .add_legend() ) .. image:: closer_look_at_viz_files/output_38_0.png .. code:: python fg = ( seaborn.FacetGrid(data=tips, hue='sex', row='smoker', col='time', margin_titles=True, size=4) .map(probscale.probplot, 'total_bill', probax='y', bestfit=True) .set_ylabels('Probability') .add_legend() ) .. image:: closer_look_at_viz_files/output_39_0.png