Swirl Cliffnotes – Getting and Cleaning Data in R – dplyr

This post contains R programming code from my study and practice. If you stumbled onto this page in a search, I hope you find my “Swirl Cliff-notes” helpful.

Manipulating CRAN Download Data with dplyr

Colorful Command List with Notes

> library(dplyr) ##Include dplyr package and it’s 5 “verb” functions: select(), filter(), arrange(), mutate(), and summarize()
> packageVersion(“dplyr”)
##Check version

> path2csv <- “/Library/Frameworks/R.framework/Versions/3.3/Resources/library/swirl/Courses/Getting_and_Cleaning_Data/Manipulating_Data_with_dplyr/2014-07-08.csv” ##RStudio’s CRAN download log from July 8, 2014, which contains information on roughly 225,000 R package downloads (http://cran-logs.rstudio.com/)
> mydf <- read.csv(path2csv,stringsAsFactors = FALSE)
##Read in csv file
> dim(mydf)
##View dimensions
> head(mydf)
##View first 6 rows
> cran <- tbl_df(mydf)
##Create a data frame table
> rm(“mydf”)
##Remove object “mydf”

> cran ##View brief contents of data frame table — truncated to 10 rows and as many columns as will fit in screen
> ?select
##Look up help information on select() function
> select(cran,ip_id,package,country)
##Subset with 3 variables(columns) and print
> select(cran,r_arch:country)
##Subset a sequence of variables
> select(cran, country:r_arch)
##Select a subset in reverse
> select(cran,-time)
##Subset all variables except
> select(cran,-(X:size))
> filter(cran,package==”swirl”)
##Select a subset by rows
> filter(cran,r_version==”3.1.1″,country==”US”)
> ?Comparison
##Look up info on ==,<=,>=,!=, etc.
> filter(cran,r_version<=”3.0.2″,country==”IN”)
> filter(cran,country==”US”|country==”IN”)
> filter(cran,size>100500, r_os==”linux-gnu”)
> filter(cran,!is.na(r_version))
##Subset of rows for which r_version is not missing
> cran2 <- select(cran,size:ip_id)
> arrange(cran2,ip_id)
##Reorder rows so specific column is ascending
> arrange(cran2,desc(ip_id))
##Reorder rows by variable descending
> arrange(cran2,package,ip_id)
##Reorder rows by multiple variables
> arrange(cran2,country,desc(r_version),ip_id)
> cran3 <- select(cran,ip_id,package,size)
> cran3
> mutate(cran3,size_mb=size/2^20)
##Create a new variable based on existing variable
> mutate(cran3,size_mb=size/2^20,size_gb=size_mb/2^10)
##Create multiple new variables
> mutate(cran3,correct_size=size+1000)
> summarize(cran,avg_bytes=mean(size))
##Collapse dataset into single row

# A tibble: 1 × 1
avg_bytes
<dbl>
1 844086.5

> savehistory(“~/Programming/class/swirl-dplyr-1.Rhistory”) ##Save history of commands

 

More on dplyr using CRAN’s download log of R packages

Let’s look at the download log (object “cran”, same as above) by grouping the rows together by R package names.

> by_package <- group_by(cran,package)

And here’s the result.

> by_package
Source: local data frame [225,468 x 11]
Groups: package [6,023]

       X       date     time    size r_version r_arch      r_os      package version country ip_id
                                           
1      1 2014-07-08 00:54:41   80589     3.1.0 x86_64   mingw32    htmltools   0.2.4      US     1
2      2 2014-07-08 00:59:53  321767     3.1.0 x86_64   mingw32      tseries 0.10-32      US     2
3      3 2014-07-08 00:47:13  748063     3.1.0 x86_64 linux-gnu        party  1.0-15      US     3
4      4 2014-07-08 00:48:05  606104     3.1.0 x86_64 linux-gnu        Hmisc  3.14-4      US     3
5      5 2014-07-08 00:46:50   79825     3.0.2 x86_64 linux-gnu       digest   0.6.4      CA     4
6      6 2014-07-08 00:48:04   77681     3.1.0 x86_64 linux-gnu randomForest   4.6-7      US     3
7      7 2014-07-08 00:48:35  393754     3.1.0 x86_64 linux-gnu         plyr   1.8.1      US     3
8      8 2014-07-08 00:47:30   28216     3.0.2 x86_64 linux-gnu      whisker   0.3-2      US     5
9      9 2014-07-08 00:54:58    5928                        Rcpp  0.10.4      CN     6
10    10 2014-07-08 00:15:35 2206029     3.0.2 x86_64 linux-gnu     hflights     0.1      US     7
# ... with 225,458 more rows

Now we apply a function to the grouped data…

> pack_sum <- summarize(
    by_package,                         ##data object to be summarized
    count = n(),                        ##column name and download count
    unique = n_distinct(ip_id),         ##unique IP addresses
    countries = n_distinct(country),    ##unique countries
    avg_bytes = mean(size)              ##average download size
  )

And here is the result.

> pack_sum
# A tibble: 6,023 × 5
       package count unique countries  avg_bytes
                       
1           A3    25     24        10   62194.96
2          abc    29     25        16 4826665.00
3     abcdeFBA    15     15         9  455979.87
4  ABCExtremes    18     17         9   22904.33
5     ABCoptim    16     15         9   17807.25
6        ABCp2    18     17        10   30473.33
7     abctools    19     19        11 2589394.00
8          abd    17     16        10  453631.24
9         abf2    13     13         9   35692.62
10       abind   396    365        50   32938.88
# ... with 6,013 more rows

How to find those packages in the top 1% of total downloads:

> quantile(pack_sum$count,probs=0.99)
   99% 
679.56

Then subset.

 > top_counts <- filter(pack_sum,count>679)

A lot of these functions condensed into one line for efficiency’s sake, but nesting functions may get confusing. The next two sections of code produce the same result, but the first one uses nesting and the second one uses the chaining operator %>% which acts to pass the object through each function.

> result2 <-
    arrange(
        filter(
            summarize(
                group_by(cran, package),
                count = n(),
                unique = n_distinct(ip_id),
                countries = n_distinct(country),
                avg_bytes = mean(size)
            ),
            countries > 60
        ),
        desc(countries),
        avg_bytes
    )
> print(result2)
> result3 <-
    cran %>%
    group_by(package) %>%
    summarize(count = n(),unique = n_distinct(ip_id),
        countries = n_distinct(country), avg_bytes = mean(size) ) %>%
    filter(countries > 60) %>%
    arrange(desc(countries), avg_bytes)
> print(result3)

Here’s one final example of chaining written in a script (no command line > symbol)

 cran %>%
 select(ip_id, country, package, size) %>%
 mutate(size_mb = size / 2^20) %>%
 filter(size_mb <= 0.5) %>%
 arrange(desc(size_mb)) %>%
 print

The print function did not need parentheses because the manipulated “cran” object was passed to it with no other arguments.

Explore my Swirl Notes on GitHub

The rest of the Swirl lessons for Getting and Cleaning Data can be found on my GitHub account here: https://github.com/emiliehwolf/swirl_practice/tree/master/Getting_and_Cleaning_Data

 

Wolfie

Wolfie lives moment to moment seeking to make life more wonderful for all. She is passionate about people, animals, nature, and health, and she helps others express their creativity and live in harmony.