StatQuest: Principal Component Analysis (PCA), Step-by-Step


Principal Component Analysis, is one of the most useful data analysis and machine learning methods out there. It can be used to identify patterns in highly complex datasets and it can tell you what variables in your data are the most important. Lastly, it can tell you how accurate your new understanding of the data actually is.

In this video, I go one step at a time through PCA, and the method used to solve it, Singular Value Decomposition. I take it nice and slowly so that the simplicity of the method is revealed and clearly explained.

There is a minor error at 1:47: Points 5 and 6 are not in the right location

If you are interested in doing PCA in R see:

If you are interested in learning more about how to determine the number of principal components, see:

For a complete index of all the StatQuest videos, check out:

If you’d like to support StatQuest, please consider…
YouTube Membership:

…a cool StatQuest t-shirt or sweatshirt (USA/Europe):

…buying one or two of my songs (or go large and get a whole album!)

…or just donating to StatQuest!

Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:

0:00 Awesome song and introduction
0:30 Conceptual motivation for PCA
3:23 PCA worked out for 2-Dimensional data
5:03 Finding PC1
12:08 Singular vector/value, Eigenvector/value and loading scores defined
12:56 Finding PC2
14:14 Drawing the PCA graph
15:03 Calculating percent variation for each PC and scree plot
16:30 PCA worked out for 3-Dimensional data

#statquest #PCA #ML


Xem thêm bài viết khác:


  1. First, great video and this helped tremendously in my understanding of how PCA works. My question is about how one begins to calculate PC levels 4 and higher since there is not a good visual analogy to these higher dimensional components. Does this require some sort of matrix algebra? Thanks again for the great video!

  2. if the number of pcs is reduced from 3 to 2 because of their variation then the plot of us has changed from 3d to 2d ,now from this can we conclude that we are removing the gene 3 for the representation of our data.

  3. Hi! can I just say that please make more videos on topics that you understand because your way of teaching is super-rare. Really easy to understand for an otherwise elusive concept (Yeah I know, I'm not that bright)

  4. Thanks for simplified explanation 😊 I have a doubt, how is covariance matrix and Eigen vectors are related in PCA?

  5. Just wanted to say your vids helped me a bunch in my intro to ML class. Despite being an intro class they kinda just throw equations up there and call it a day. These explanations are very intuitive. Thanks.

  6. Thank you for this wonderful video – turning a very abstract concept into something that we can interpret with (biological) meaning!

  7. Josh please help me understand. I have 4 variables (columns)(X1,X2, X3, X4) and 8 entries (rows) then I find out pca1 which has a Eigen vector. How is this pca1 column filled. What are the 8 values of pca1 column? How do we get those from Eigen vector or Eigen value. Please help.Eagerly waiting

  8. Hi there, can someone help me understand what he mean by "1,2,3 are more similar to each other than 4,5,6" at 1:40.Thank you and stay safe!

  9. I am a non-cs Student and after watching your video i bet will be ahead of most cs students with ML as course.Thank U….

  10. Thankyou very much! So much appreciate your work. I have a question for you sir.
    q1) When calculating PC2, should the line must pass through the origin? or should we calculate the mean of PC1?
    q2) How do we calculate the position of points in new PCA plot only by using distance between origin and projected points?

  11. Josh, if you added a little animated character for your intro song I think we could make a video around the entire song, and you too could become a one-hit wonder!

  12. I like how well you explain this. Thank you.

    1 Question, in the 3-dimensional example, how do you know what orientation to rotate PC1 to begin with?

    In 2 dimensions it's pretty obvious, but in 3 it seems really complicated. Is it constricted to rotating between two axes?

  13. I have watched many videos of yours and liked it a lot.
    Have one query. With 4 variables for example, we obtained 4 PCs and 3 are found sufficient. In that case, the variable dimension reduces from 4 to 3. Now, my question is, does it mean removing one variable among 4? Does it mean we are left with same variable set but only 3 variables? Or does it mean that we will be ending up with 3 variables but with totally new data sets?

  14. Awesome work as always Josh. Just one question. In regression, we try to minimize the sum of the squared vertical distances between the line and the actual point. But here we try to minimize the sum of the squared projected distances from the line to the actual point. Why do we try to minimize different distances in regression vs. pca?


Please enter your comment!
Please enter your name here