What is ggplot2 ?
- It is a package in R.
- An implementation of the grammar of graphics.
- It is a ‘third’ graphics system for R(along with base and lattice), build based upon the grid system.
- Available from CRAN via install.package().
Grammer of graphics : The grammer tells us that a statistical graphics is a mapping from data to aesthetic attributes(color,shape,size) of geometric objects(points, lines, bars). The plot may also contains statistical transformations of the data and is drawn in a specific co-ordinate system.
Plotting systems in R : ggplot2
- Split the difference between base and lattice
- Automatically deals with spacings,text,titles but also allows you to state by adding.
- Superficial similarity to lattice but generally easier/more intuitive to use.
- Default mode makes many choices for you(you can customise).
The basics of ggplot2 : qplot()
- Works much similar to plot() [the base method] in base graphics system.
- Looks for data in a data frame, similar to lattice, or in the parent environment.
- Plots are made up of aesthetic (size, color, shape) and grooms(points, lines).
- Factors are important for indicating subsets of the data (if they are to have different properties); they should be labeled.
- The qplot () hides what goes on underneath, which is okay for most operations
- ggplot () is the core function and very flexible for doing things, which qplot () can not do.
Lets’ have a look a built in example of ggplot2
Here is the structure of the built-in data The structure looks like this, here in this data we will th relationship between 3 variables “hwy”, “displ”, “drv”. “drv” variable has three factor “f”, “r”, “4”. While other variables are the characteristics and specification of the cars. Let’s plot a basic qplot() of “mpg” data between “displ” and “hwy”.
qplot(displ, hwy, data=mpg, color=mpg$displ) # displ=x co-ordinate, hwy=y co-ordinate and mpg is dataframe, here i am differenciating displ variable with color.
Here is the plot: In the above graph, we gave the color to mpg$displ which is not valid, I will explain what is the main purpose of the data is we have to find out the miles per gallon for different drive, on highway for respective displ variable.
qplot(displ, hwy, data=mpg,color=drv) #color=drv, it will give different color to each factor.
qplot(displ, hwy, data=mpg, color=drv, geom=c("point","smooth"))
Here is the plot: We have seen both the graphs with or without “geom” parameter, we can clearly decide which one is more understandable. We know that both graphs are same. But the graph with geom parameter adds makes it clear to understand. We can make histogram with the qplot() by only specifying a single variable. Here we use “hwy”, and it shows us the ‘hwy’ mileage for all the cars in the dataset. We can differentiate , which car is 4 wheeler, front, rear wheel drive with “fill” argument it will fill the different elements. Histogramis going to be filled with different colors based on what drive they are.
Have a look below: Another feature of ggplot is called facets. Facets are like panels in lattice. We can create separate plots which indicates again subsets of your data indicating by a factor variable and you can make a panel plots to look at the separate subsets together, so one option will be the color code of the subset, according to different colors, like we did before.
qplot(displ, hwy,data=mpg, facets=.~drv)
Here is the plot:
But, if we have lots of data point that can be tricky to lookout and the color code can overlap and may be difficult to see the separate groups, easier way to do that is split up to 3 groups in the separate panel and make 3 separate class.
qplot(displ, hwy, data=mpg, facets=drv~., binwidth=2, color=drv)
- facets = .~drv
- facets = drv~.
Can you guys tell me the difference between the two above points. Don’t worry I will you, if you know already, then please ignore it. facets = .~drv means there is only one row and more than one column will be there in your graphs and columns depends on the number of factors, what you are giving to the function. facets = drv~. means there is only one column and more than one row will be there in your graphs and rows depends on the number of factors, what you are giving to the function. I hope you get the basic idea of ggplot2, Please keep in touch, in few days i will write about the advance ggplot2. If you have any queries please mention in comments section or shoot me an email at email@example.com. This article originally posted here.