Gaining a First Impression on Your Data

This article covers some methods of inspecting the data when you get it at the first time.

Basic Methods

In this example we use the dataset PlantGrowth. We can simply gain a first impression on the data by simply calling PlantGrowth directly.

PlantGrowth

   weight group
1    4.17  ctrl
2    5.58  ctrl
3    5.18  ctrl
4    6.11  ctrl
5    4.50  ctrl
6    4.61  ctrl
7    5.17  ctrl
8    4.53  ctrl
9    5.33  ctrl
10   5.14  ctrl
11   4.81  trt1
12   4.17  trt1
13   4.41  trt1
14   3.59  trt1
15   5.87  trt1
16   3.83  trt1
17   6.03  trt1
18   4.89  trt1
19   4.32  trt1
20   4.69  trt1
21   6.31  trt2
22   5.12  trt2
23   5.54  trt2
24   5.50  trt2
25   5.37  trt2
26   5.29  trt2
27   4.92  trt2
28   6.15  trt2
29   5.80  trt2
30   5.26  trt2

We begin with the conventional approach: loading the data and save it as a data.frame.

df <- PlantGrowth

We can print some basic information about this data.frame using the str() method.

str(df)

'data.frame':   30 obs. of  2 variables:
 $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
 $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...

Using summary() is a very better way of gaining a comprehensive view on the data.

summary(df)

     weight       group   
 Min.   :3.590   ctrl:10  
 1st Qu.:4.550   trt1:10  
 Median :5.155   trt2:10  
 Mean   :5.073            
 3rd Qu.:5.530            
 Max.   :6.310

We can also get an overview on values of every columns.

df$weight

 [1] 4.17 5.58 5.18 6.11 4.50 4.61 5.17 4.53 5.33 5.14 4.81 4.17 4.41 3.59 5.87
[16] 3.83 6.03 4.89 4.32 4.69 6.31 5.12 5.54 5.50 5.37 5.29 4.92 6.15 5.80 5.26

df$group

 [1] ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl trt1 trt1 trt1 trt1 trt1
[16] trt1 trt1 trt1 trt1 trt1 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2
Levels: ctrl trt1 trt2

In addition, we can also show the first or the last rows.

head(df)

  weight group
1   4.17  ctrl
2   5.58  ctrl
3   5.18  ctrl
4   6.11  ctrl
5   4.50  ctrl
6   4.61  ctrl

tail(df)

   weight group
25   5.37  trt2
26   5.29  trt2
27   4.92  trt2
28   6.15  trt2
29   5.80  trt2
30   5.26  trt2

Inspect the Data using the `{dplyr}` Package

dplyr is one of the most popular package from the tidyverse universe. Before we proceed, we need to load this package.

pacman::p_load(dplyr)

`dplyr::glimpse()`

The dplyr::glimpse() method provides a similar output as the str() method.

glimpse(PlantGrowth)

Rows: 30
Columns: 2
$ weight <dbl> 4.17, 5.58, 5.18, 6.11, 4.50, 4.61, 5.17, 4.53, 5.33, 5.14, 4.8…
$ group  <fct> ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, trt…

Using `tibble`

A tibble is a more efficient version of a data.frame. A tibble is based on data.frame and in most cases its behavior still mimics a data.frame. Despite, here are some benefits of using tibble over data.frame:

It tells the data type of each column.
Instead of printing all the data, it only prints a limited numbers of rows and columns, so that the console won’t be flooded by too many text.
Missing values and negative values are printed in red.

By using the function dplyr::as_tibble(), we can convert a df into a tibble.

tbl <- as_tibble(df)
tbl

# A tibble: 30 × 2
   weight group
    <dbl> <fct>
 1   4.17 ctrl 
 2   5.58 ctrl 
 3   5.18 ctrl 
 4   6.11 ctrl 
 5   4.5  ctrl 
 6   4.61 ctrl 
 7   5.17 ctrl 
 8   4.53 ctrl 
 9   5.33 ctrl 
10   5.14 ctrl 
# ℹ 20 more rows

Print a `tibble`

By default 10 rows are printed.

print(tbl) # To 10 rows

# A tibble: 30 × 2
   weight group
    <dbl> <fct>
 1   4.17 ctrl 
 2   5.58 ctrl 
 3   5.18 ctrl 
 4   6.11 ctrl 
 5   4.5  ctrl 
 6   4.61 ctrl 
 7   5.17 ctrl 
 8   4.53 ctrl 
 9   5.33 ctrl 
10   5.14 ctrl 
# ℹ 20 more rows

To print more rows we can:

print(tbl, n = 20) # To 20 rows

# A tibble: 30 × 2
   weight group
    <dbl> <fct>
 1   4.17 ctrl 
 2   5.58 ctrl 
 3   5.18 ctrl 
 4   6.11 ctrl 
 5   4.5  ctrl 
 6   4.61 ctrl 
 7   5.17 ctrl 
 8   4.53 ctrl 
 9   5.33 ctrl 
10   5.14 ctrl 
11   4.81 trt1 
12   4.17 trt1 
13   4.41 trt1 
14   3.59 trt1 
15   5.87 trt1 
16   3.83 trt1 
17   6.03 trt1 
18   4.89 trt1 
19   4.32 trt1 
20   4.69 trt1 
# ℹ 10 more rows

To print all rows we can:

print(tbl, n = Inf) # To print all rows

# A tibble: 30 × 2
   weight group
    <dbl> <fct>
 1   4.17 ctrl 
 2   5.58 ctrl 
 3   5.18 ctrl 
 4   6.11 ctrl 
 5   4.5  ctrl 
 6   4.61 ctrl 
 7   5.17 ctrl 
 8   4.53 ctrl 
 9   5.33 ctrl 
10   5.14 ctrl 
11   4.81 trt1 
12   4.17 trt1 
13   4.41 trt1 
14   3.59 trt1 
15   5.87 trt1 
16   3.83 trt1 
17   6.03 trt1 
18   4.89 trt1 
19   4.32 trt1 
20   4.69 trt1 
21   6.31 trt2 
22   5.12 trt2 
23   5.54 trt2 
24   5.5  trt2 
25   5.37 trt2 
26   5.29 trt2 
27   4.92 trt2 
28   6.15 trt2 
29   5.8  trt2 
30   5.26 trt2

Transform a `tibble` into a `data.frame`

A tibble can be transformed into a data.frame.

tbl %>% as.data.frame()

   weight group
1    4.17  ctrl
2    5.58  ctrl
3    5.18  ctrl
4    6.11  ctrl
5    4.50  ctrl
6    4.61  ctrl
7    5.17  ctrl
8    4.53  ctrl
9    5.33  ctrl
10   5.14  ctrl
11   4.81  trt1
12   4.17  trt1
13   4.41  trt1
14   3.59  trt1
15   5.87  trt1
16   3.83  trt1
17   6.03  trt1
18   4.89  trt1
19   4.32  trt1
20   4.69  trt1
21   6.31  trt2
22   5.12  trt2
23   5.54  trt2
24   5.50  trt2
25   5.37  trt2
26   5.29  trt2
27   4.92  trt2
28   6.15  trt2
29   5.80  trt2
30   5.26  trt2

Compare `data.frame` and `tibble`

Here is a comparison of calling some common functions on a tibble and on a data.frame.

class(tbl)

[1] "tbl_df"     "tbl"        "data.frame"

str(tbl)

tibble [30 × 2] (S3: tbl_df/tbl/data.frame)
 $ weight: num [1:30] 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
 $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...

head(tbl)

# A tibble: 6 × 2
  weight group
   <dbl> <fct>
1   4.17 ctrl 
2   5.58 ctrl 
3   5.18 ctrl 
4   6.11 ctrl 
5   4.5  ctrl 
6   4.61 ctrl

summary(tbl)

     weight       group   
 Min.   :3.590   ctrl:10  
 1st Qu.:4.550   trt1:10  
 Median :5.155   trt2:10  
 Mean   :5.073            
 3rd Qu.:5.530            
 Max.   :6.310

tbl$weight

 [1] 4.17 5.58 5.18 6.11 4.50 4.61 5.17 4.53 5.33 5.14 4.81 4.17 4.41 3.59 5.87
[16] 3.83 6.03 4.89 4.32 4.69 6.31 5.12 5.54 5.50 5.37 5.29 4.92 6.15 5.80 5.26

tbl$group

 [1] ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl trt1 trt1 trt1 trt1 trt1
[16] trt1 trt1 trt1 trt1 trt1 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2
Levels: ctrl trt1 trt2

class(df)

[1] "data.frame"

str(df)

'data.frame':   30 obs. of  2 variables:
 $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
 $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...

head(df)

  weight group
1   4.17  ctrl
2   5.58  ctrl
3   5.18  ctrl
4   6.11  ctrl
5   4.50  ctrl
6   4.61  ctrl

summary(df)

     weight       group   
 Min.   :3.590   ctrl:10  
 1st Qu.:4.550   trt1:10  
 Median :5.155   trt2:10  
 Mean   :5.073            
 3rd Qu.:5.530            
 Max.   :6.310

df$weight

 [1] 4.17 5.58 5.18 6.11 4.50 4.61 5.17 4.53 5.33 5.14 4.81 4.17 4.41 3.59 5.87
[16] 3.83 6.03 4.89 4.32 4.69 6.31 5.12 5.54 5.50 5.37 5.29 4.92 6.15 5.80 5.26

df$group

 [1] ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl trt1 trt1 trt1 trt1 trt1
[16] trt1 trt1 trt1 trt1 trt1 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2
Levels: ctrl trt1 trt2

What’s next?

Reading the numerical data is intuitive, despite, we prefer “seeing” the data. Navigate to the introduction to ggplot2 for more information about plotting the data.

In addition, you can also navigate to this article to gain knowledge about how you can use your data and extract useful information.

Basic Methods

Inspect the Data using the {dplyr} Package

dplyr::glimpse()

Using tibble

Print a tibble

Transform a tibble into a data.frame

Compare data.frame and tibble