Reading large files in R

Reading large data files into R using  read.table can be painfully slow due to the large number of checks it performs on the data (inferring data types, checking number of columns, etc.). Below are a number of tips to improve its speed:

  1. Set  nrows the number of records in your dataset. An easier way to obtain this is to use wc -l  in the terminal (only works on Unix systems). A handy function to do this from R:
  2. Ensure  comment.char="" to turn off the interpretation of comments.
  3. Supply the column types in colClasses  e.g.  colClasses = c("numeric", "character", "logical")
  4. Setting multi.line=FALSE  can help the performance of the scan.
  5. In general, it is worth using  as.is=TRUE to prevent strings being turned into factors automatically.  If you provide  colClasses  this is redundant (as the column types will be determined by colClasses ).

If your data is very large (5GB+), you should look at using the data.table package. In particular  fread can be used to read very large files to data tables.

 

Significance Testing and Sample Size

A common question when analysing any sort of online data:

Is this result statistically significant?

Examples of where I have had to answer  this question:

  • A/B Testing
  • Survey Analysis

In this post, we will explore ways to answer this question and determine the sample sizes we need.

Note: This post was heavily influenced by Marketing Distillery's excellent  A/B Tests in Marketing.

Terminology

First let us define some standard statistical terminology:

  • We are deciding between two hypotheses - the null hypothesis and the alternative hypothesis.
  • The null hypothesis is the default assumption that there is no relationship between two values.  For instance, if comparing conversion rates of two landing pages, the null hypothesis is that the conversion rates are the same.
  • The alternative hypothesis is what we assume if the null hypothesis is rejected. It can be two-tailed, the conversion rates are not the same, or one-tailed, where we state the direction of the inequality.
  • A result is determined to be statistically significant if the observed effect was unlikely to have been seen by random chance.
  • How unlikely is determined by the significance level. Commonly it is set to 5%. It is the probability of rejecting the null hypothesis when it is true. This probability is denoted by \alpha

\alpha = P(rejecting\ null\ hypothesis\ |\ null\ hypothesis\ is\ true)

  • Rejecting the null hypothesis when the null hypothesis is true, is called a type I error.
  • A type II error occurs when the null hypothesis is false,  but is erroneously failed to be rejected. This probability of this is denoted by \beta

\beta = P(fail\ to\ reject\ null\ hypothesis\ |\ null\ hypothesis\ is\ false)

 Null hypothesis (H0) is trueNull hypothesis (H0) is false
Reject null hypothesisType I error
False positive
Correct outcome
True positive
Fail to reject null hypothesisCorrect outcome
True negative
Type II error
False negative
  • The power of a test is 1-\beta. Commonly this is set to 90%.
  • The p-value is calculated from the test statistic. It is the probability that the observed result would have been seen by chance assuming the null hypothesis is true.
  • To reject the null hypothesis we need a p-value that is lower than the selected significance level.

Types of Data

Usually we have two types of data we want to perform significance tests with:

  1. Proportions e.g. Conversion Rates, Percentages
  2. Real valued numbers e.g. Average Order Values, Average Time on Page

In this post, we will look at both.

Tests for Proportions

A common scenario is we have run an A/B test of two landing pages and we wish to test if the conversion rate is significantly different between the two.

An important assumption here is that the two groups are mutually exclusive e.g. you can only have been shown one of the landing pages.

Null hypothesis is that the proportions are equal in both groups.

The test can be performed in R using:

Under the hood this is performing a Pearson's Chi-Squared Test.

Regarding the parameters:

x - Is usually a 2x2 matrix giving the counts of successes and failures in each group (conversions and non-conversions for instance).

n - Number of trials performed in each group. Can leave null if you provide x as a matrix.

p - Only if you want to test for equality against a specific proportion (e.g. 0.5).

alternative - Generally only used when testing against a specific p. Changes the underlying test to use a z-test. See Two-Tailed Test of Population Proportion for more details. The z-test may not be a good assumption with small sample sizes.

conf.level - Used only in the calculation of a confidence interval. Not used as part of the actual test.

correct - Usually safe to apply continuity correction. See my previous post for more details on the correction.

The important part of the output of prop.test is the p-value. If this is less than your desired significance level (say 0.05) you reject the null hypothesis.

In this example, the p-value is not less than our desired significance level (0.05) so we cannot reject the null hypothesis.

Before running your test, you should fix your sample size in each group in advance. The R library pwr has various functions for helping you do this. The function pwr.2p.test can be used for this:

Any one of the parameters can be left blank and the function will estimate its value. For instance, leaving n, the sample size, blank will mean the function will compute the desired sample size.

The only new parameter here is h. This is the minimum effect size you wish to be able to detect. h is calculated as the difference of the arcsine transformation of two proportions:

h = 2 \arcsin(\sqrt{p_1}) - 2\arcsin(\sqrt{p_2})

Assuming you have an idea of the base rate proportion (e.g. your current conversion rate) and the minimum change you want to detect, you can use the follow R code to calculate h:

 Tests for real-values

Now imagine we want to test if a change to a eCommerce web-page increases the average order value.

The major assumption we are going to make is that the data we are analysing is normally distributed. See my previous post on how to check if the data is normally distributed.

It may be possible to transform the data to a normal distribution, for instance if the data is log-normal. Time on page often looks to fit a log-normal distribution, in this case you can just take the log of the times on page.

The test we will run is the two-sample t-test. We are testing if the means of the two groups are significantly different.

The parameters:

x - The samples in one group

y - The samples in the other group. Leave NULL for a single sample test

alternative - Perform two-tailed or one-tailed test?

mu - Use this if you know the true mean

paired - TRUE for the paired t-test, used for data where the two samples have a natural partner in each set, for instance comparing the weights of people before and after a diet.

var.equal - Assume equal variance in the two groups or not

conf.level - Used in calculating the confidence interval around the means. Not part of the actual test.

The test can be run using:

In this example, as our p-value is greater than 0.05 significance level, so we cannot reject the null hypothesis.

As before, before running the experiment we should set the sample size required. Using the pwr library we can use:

d is the effect size. For t-tests the affect size is assessed as:

d = \frac{|\mu_1 - \mu_2|}{\sigma}

where \mu_1 is the mean of group 1, \mu_2 is the mean of group 2 and \sigma is the pooled standard deviation.

Ideally you should set d based on your problem domain, quantifying the effect size you expect to see. If this is not possible, Cohen's book on power analysis "Statistical Power Analysis for the Behavioral Sciences", suggests setting d to be 0.2, 0.5 and 0.8 for small, medium and large effect sizes respectively.

Testing for Normality

The most widely used distribution in statistical analysis is the normal distribution.  Often when performing some sort of data analysis, we need to answer the question:

Does this data sample look normally distributed?

There are a variety of statistical tests that can help us answer this question, however first we should try to visualise the  data.

Visualisation

Below is a simple piece of R code to create a histogram and normal Q-Q plot:

NormalityCheck

On top of the histogram we have have overlayed the normal probability density distribution using the sample mean and standard deviation.

In the Q-Q plot, we are plotting the quantiles of the sample data against the quantiles of a standard normal distribution. We are looking to see a roughly linear relationship between the sample and the theoretical quantiles.

In this example, we would probably conclude, based on the graphs, that the data is roughly normally distributed (which is true in this case as we randomly generated it from a normal distribution).

It is worth bearing in mind, many data analysis techniques assume normality (linear regression, PCA, etc.) but are still useful even with some deviations from normality.

Statistical Tests for Normality

We can also use several tests for normality. The two most common are the Anderson-Darling test and the Shapiro-Wilk test.

The Null Hypothesis of both these tests is that the data is normally distributed.

Anderson-Daring Test

The test is based on the distance between the empirical distribution function (EDF) and the cumulative distribution function (CDF) of the underlying distribution (e.g. Normal). The statistic is the weighted sum of the difference between the EDF and CDF, with more weight applied at the tails (making it better at detecting non-normality at the tails).

Below is the R code to perform this test:

As the p-value is above the significance level, we would accept the null hypothesis (normality).

Shapiro-Wilk Test

The Shapiro-Wilk test is a regression-type test that uses the correlation of sample order statistics (the sample values arranged in ascending order) with those of a normal distribution. Below is the R code to perform the test:

Again, as the p-value is above the significance level, we would accept the null hypothesis (normality).

The Shapiro-Wilk test is slightly more powerful than the Anderson test, but is limited to 5000 samples.

Test Caution

While these tests can be useful, they are not infallible. I would recommend looking at the histogram and Q-Q plots first, and use the tests as another check.

In particular small and large samples sizes can cause problems for the tests.

With a sample size of 10 or less, it is unlikely the test will detect non-normality (e.g. reject the null) even if the distribution is truly non-normal.

With a large sample size of 1000 samples or more, a small deviation from normality (some noise in the sample) may be concluded as significant and reject the null.

Overall your two best tools are the histogram and Q-Q plot. Potentially using the tests as additional indicators.