When it comes to data analysis in Python, the duo of NumPy and Pandas has long been the go-to solution for most data scientists and analysts. However, while these libraries are powerful, they have limitations, especially when it comes to performing advanced statistical analyses. Enter Pingouin – a simple yet highly versatile statistical package built on top of Pandas and NumPy.

Pingouin is designed to make complex statistical analyses as easy and intuitive as possible while maintaining high performance. In this post, we’ll dive deep into how Pingouin can boost your data analytics workflow, providing you with a streamlined alternative to traditional methods. We’ll also showcase examples and compare its performance.

Why Pingouin?

The Pingouin library shines when it comes to simplifying and enhancing statistical analysis tasks. It provides an intuitive interface to perform statistical tests, compute effect sizes, perform correlation analyses, and more – all with minimal code.

Here’s why Pingouin is gaining traction:

  • Ease of Use: Pingouin simplifies advanced statistical procedures, reducing the amount of code required.
  • Rich Features: It supports a variety of statistical tests (T-tests, ANOVA, correlations, etc.), with built-in correction methods and effect size measures.
  • Efficiency: Built on top of Pandas, Pingouin efficiently handles large datasets while offering improved computational performance.
  • Publication-Ready Output: Pingouin provides outputs that are easy to interpret and publish directly in papers or reports.

Installation

To install Pingouin, simply run:

pip install pingouin

Example 1: T-tests and Effect Sizes

One of the core functionalities of Pingouin is running t-tests, but it doesn’t stop there. It also calculates effect sizes, confidence intervals, and makes it all available in a clean output.

Traditional Approach (Using SciPy)

Using SciPy to run a T-test looks like this:

from scipy.stats import ttest_ind
import numpy as np

# Sample data
group1 = np.random.normal(loc=5, scale=1.5, size=100)
group2 = np.random.normal(loc=6, scale=1.8, size=100)

# T-test
stat, pval = ttest_ind(group1, group2)
print(f"T-statistic: {stat}, p-value: {pval}")

With Pingouin

Here’s the same analysis, but with Pingouin. Note how much more informative and concise the output is.

import pingouin as pg
import pandas as pd

# Creating a dataframe
df = pd.DataFrame({
    'group1': np.random.normal(loc=5, scale=1.5, size=100),
    'group2': np.random.normal(loc=6, scale=1.8, size=100)
})

# Running a T-test with Pingouin
t_test_results = pg.ttest(df['group1'], df['group2'], paired=False)
print(t_test_results)

Pingouin returns a dataframe that not only shows the T-statistic and p-value, but also includes effect size metrics like Cohen’s d, the degrees of freedom, and confidence intervals.

Output:

T dof p-val cohen-d CI95%
-3.45 198 0.0007 0.485 [-0.78, -0.19]

This table is ready for publication and much more informative than the basic output you’d get from SciPy.

Performance Comparison

While both methods are fast, Pingouin’s built-in features like confidence intervals and effect size calculations make it the better choice for deeper insights without having to write additional code.

Example 2: ANOVA and Post-Hoc Analysis

ANOVA tests are crucial when comparing more than two groups. Pingouin allows you to run one-way or repeated measures ANOVA, along with post-hoc analysis in just a few lines.

With Pingouin

Let’s say you want to compare the performance of three different treatments on a group of subjects:

# Generating random data
np.random.seed(123)
df = pg.read_dataset('anova')

# One-way ANOVA
anova_results = pg.anova(dv='Pain threshold', between='Condition', data=df, detailed=True)
print(anova_results)

# Post-hoc test
posthoc_results = pg.pairwise_tests(dv='Pain threshold', between='Condition', data=df, padjust='bonferroni')
print(posthoc_results)

Output:

Source SS DF MS F p-unc np2
Condition 20.26 2 10.13 6.75 0.0023 0.108

Pingouin provides publication-ready results, including p-values, F-statistics, and effect sizes (partial eta-squared). This is much more efficient compared to manually calculating effect sizes after running the test in Pandas.

Post-Hoc Analysis

Pingouin allows for multiple comparisons (using Bonferroni or Holm corrections), which is crucial when interpreting ANOVA results. The post-hoc test provides insights into which specific groups are significantly different.

Contrast p-val padj effsize CI95%
Group 1 vs 2 0.0012 0.0056 0.74 [0.39, 1.12]
Group 2 vs 3 0.0453 0.1360 0.52 [0.15, 0.98]

Performance Insight

Pingouin handles these calculations in an optimized manner, taking advantage of Pandas for data handling and offering built-in post-hoc adjustments that save time compared to manually setting up multiple tests in other libraries.

Example 3: Correlation Analysis

Correlation analysis is another essential part of data analysis workflows, and Pingouin makes it easy to compute correlations, along with p-values, confidence intervals, and outlier detection.

Traditional Method (Using SciPy)

from scipy.stats import pearsonr

# Data
x = np.random.normal(size=100)
y = np.random.normal(size=100)

# Pearson correlation
corr, pval = pearsonr(x, y)
print(f"Pearson Correlation: {corr}, P-value: {pval}")

With Pingouin

# Pearson correlation with Pingouin
corr_results = pg.corr(x, y)
print(corr_results)

Pingouin provides additional insights such as confidence intervals and the Bayes factor for correlations, making it a more robust choice for analysts.

Output:

n r CI95% p-val BF10 power
100 0.034 [-0.23, 0.27] 0.731 0.122 0.071

The Bayes factor (BF10) gives you a sense of how strong the evidence is for the null hypothesis, something not readily available in most other Python libraries.

Performance Benefits and Conclusion

Pingouin not only makes performing advanced statistical analyses easier but also enhances productivity by offering more comprehensive, ready-to-use outputs. From built-in correction methods to effect size calculations, Pingouin removes the need to use multiple packages, cutting down on development time and improving workflow efficiency.

By leveraging Pingouin, data scientists can focus more on interpreting results rather than spending time calculating effect sizes, confidence intervals, and p-values manually. Its seamless integration with Pandas and NumPy ensures that it can handle large datasets efficiently without sacrificing speed.

If you’re looking for a powerful, easy-to-use Python library that enhances traditional data analytics workflows, Pingouin is an excellent choice. Its built-in features and well-structured outputs will make your statistical analyses both faster and more robust.

References

  1. Pingouin Official Documentation

    • Website: https://pingouin-stats.org/
    • Provides comprehensive guides, tutorials, and API references for using Pingouin in statistical analysis.
  2. Vallat, R. (2018). Pingouin: statistics in Python. Journal of Open Source Software, 3(31), 1026.

    • DOI: 10.21105/joss.01026
    • The original paper introducing Pingouin, detailing its features, capabilities, and applications in statistical analysis.
  3. Pingouin GitHub Repository

  4. Vallat, R., & Pingouin Contributors. (2020). Pingouin: A Python Toolbox for Statistics. Proceedings of the Python in Science Conference.

    • Paper: Link to the conference paper
    • A conference paper discussing advanced features and practical use cases of Pingouin in scientific research.

Books

  1. “Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython”
    By Wes McKinney

    • ISBN: 978-1491957660
    • Link: Amazon
    • Focuses on data manipulation and analysis using Python libraries, complementing Pingouin’s statistical functions.
  2. “Think Stats: Exploratory Data Analysis in Python”
    By Allen B. Downey

    • ISBN: 978-1491907337
    • Link: O’Reilly Media
    • Introduces statistical concepts through programming examples and exercises in Python.
  3. “Statistics for Machine Learning: Techniques for exploring supervised, unsupervised, and reinforcement learning models with Python and R”
    By Pratap Dangeti

    • ISBN: 978-1788295758
    • Link: Packt Publishing
    • A practical guide covering statistical concepts relevant to machine learning with examples in Python.
  4. “Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python”
    By Stefanie Molin

    • ISBN: 978-1789615326
    • Link: Packt Publishing
    • Focuses on data analysis techniques using Pandas, which can be integrated with Pingouin.
  5. “An Introduction to Statistical Learning: With Applications in R”
    By Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani

    • ISBN: 978-1461471370
    • Link: Springer
    • Covers statistical concepts applicable across programming languages; useful for understanding the theory behind Pingouin’s functions.
  6. “Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2”
    By Sebastian Raschka and Vahid Mirjalili

    • ISBN: 978-1789955750
    • Link: Packt Publishing
    • Includes statistical foundations relevant to data analytics and machine learning in Python.

Additional Resources

  • Online Tutorials and Blog Posts on Pingouin

  • “Data Science from Scratch: First Principles with Python”
    By Joel Grus

    • ISBN: 978-1492041139
    • Link: O’Reilly Media
    • A comprehensive introduction to data science concepts implemented from scratch in Python, helping to understand the underlying mechanics of statistical functions.

These resources should enhance your understanding and application of Pingouin in data analytics workflows. They offer both theoretical background and practical guidance, making them valuable additions to your learning toolkit.