{"id":3432,"date":"2024-09-01T14:00:38","date_gmt":"2024-09-01T14:00:38","guid":{"rendered":"https:\/\/workhouse.sweetdishy.com\/?p=3432"},"modified":"2024-09-01T14:00:38","modified_gmt":"2024-09-01T14:00:38","slug":"k-means-clustering-unsupervised-clustering","status":"publish","type":"post","link":"https:\/\/workhouse.sweetdishy.com\/index.php\/2024\/09\/01\/k-means-clustering-unsupervised-clustering\/","title":{"rendered":"K-Means\u00a0Clustering\u00a0(Unsupervised\/Clustering)"},"content":{"rendered":"\n<p id=\"Par197\">The k-Means clustering algorithm, which is effective for large datasets, puts similar, unlabeled data into different groups. The first step is to select k, which is the number of clusters. To help with this, you can perform visualizations of that data to see if there are noticeable grouping areas.<\/p>\n\n\n\n<p>Here\u2019s a look at sample data, in Figure\u00a03-6:<\/p>\n\n\n\n<figure class=\"wp-block-image\" id=\"Fig6\"><img decoding=\"async\" src=\"https:\/\/learning.oreilly.com\/api\/v2\/epubs\/urn:orm:book:9781484250280\/files\/images\/480660_1_En_3_Chapter\/480660_1_En_3_Fig6_HTML.jpg\" alt=\"..\/images\/480660_1_En_3_Chapter\/480660_1_En_3_Fig6_HTML.jpg\"\/><figcaption class=\"wp-element-caption\"><strong><em>Figure 3-6.<\/em><\/strong>The initial plot for a dataset<\/figcaption><\/figure>\n\n\n\n<p>For this example, we assume there will be\u00a0two\u00a0clusters, and this means there will also be two\u00a0centroids. A centroid is the midpoint of a cluster. We will assign each randomly, which you can see in Figure\u00a03-7.<\/p>\n\n\n\n<figure class=\"wp-block-image\" id=\"Fig7\"><img decoding=\"async\" src=\"https:\/\/learning.oreilly.com\/api\/v2\/epubs\/urn:orm:book:9781484250280\/files\/images\/480660_1_En_3_Chapter\/480660_1_En_3_Fig7_HTML.jpg\" alt=\"..\/images\/480660_1_En_3_Chapter\/480660_1_En_3_Fig7_HTML.jpg\"\/><figcaption class=\"wp-element-caption\"><strong><em>Figure 3-7.<\/em><\/strong>This chart shows two centroids\u2014represented by circles\u2014that are randomly placed<\/figcaption><\/figure>\n\n\n\n<p>As you can see, the centroid at the top left looks way off, but the one on the right side is better. The k-Means algorithm will then calculate the average distances of the centroids and then change their\u00a0locations\u00a0. This will be iterated until the errors are fairly minimal\u2014a point that is called convergence, which you can see with Figure\u00a03-8.<\/p>\n\n\n\n<figure class=\"wp-block-image\" id=\"Fig8\"><img decoding=\"async\" src=\"https:\/\/learning.oreilly.com\/api\/v2\/epubs\/urn:orm:book:9781484250280\/files\/images\/480660_1_En_3_Chapter\/480660_1_En_3_Fig8_HTML.jpg\" alt=\"..\/images\/480660_1_En_3_Chapter\/480660_1_En_3_Fig8_HTML.jpg\"\/><figcaption class=\"wp-element-caption\"><strong><em>Figure 3-8.<\/em><\/strong>Through iterations, the k-Means algorithm gets better at grouping the data<\/figcaption><\/figure>\n\n\n\n<p id=\"Par201\">Granted, this is a simple illustration. But of course, with a complex dataset, it will be difficult to come up with the number of initial clusters. In this situation, you can&nbsp;experiment&nbsp;with different k values and then measure the average distances. By doing this multiple times, there should be more accuracy.<\/p>\n\n\n\n<p>Then why not just have a high number for k? You can certainly do this. But when you compute the average, you\u2019ll notice that there will be only incremental improvements. So one method is to stop at the point where this starts to occur. This is seen in Figure\u00a03-9.<\/p>\n\n\n\n<figure class=\"wp-block-image\" id=\"Fig9\"><img decoding=\"async\" src=\"https:\/\/learning.oreilly.com\/api\/v2\/epubs\/urn:orm:book:9781484250280\/files\/images\/480660_1_En_3_Chapter\/480660_1_En_3_Fig9_HTML.jpg\" alt=\"..\/images\/480660_1_En_3_Chapter\/480660_1_En_3_Fig9_HTML.jpg\"\/><figcaption class=\"wp-element-caption\"><strong><em>Figure 3-9.<\/em><\/strong>This shows the optimal point of the k value in the k-Means algorithm<\/figcaption><\/figure>\n\n\n\n<p>However, k-Means has its\u00a0drawbacks\u00a0. For instance, it does not work well with nonspherical data, which is the case with Figure\u00a03-10.<\/p>\n\n\n\n<figure class=\"wp-block-image\" id=\"Fig10\"><img decoding=\"async\" src=\"https:\/\/learning.oreilly.com\/api\/v2\/epubs\/urn:orm:book:9781484250280\/files\/images\/480660_1_En_3_Chapter\/480660_1_En_3_Fig10_HTML.jpg\" alt=\"..\/images\/480660_1_En_3_Chapter\/480660_1_En_3_Fig10_HTML.jpg\"\/><figcaption class=\"wp-element-caption\"><strong><em>Figure 3-10.<\/em><\/strong>Here\u2019s a demonstration where k-Means does not work with nonspherical data<\/figcaption><\/figure>\n\n\n\n<p id=\"Par204\">With this, the k-Means&nbsp;algorithm&nbsp;would likely not pick up on the surrounding data, even though it has a pattern. But there are some algorithms that can help, such as DBScan (density-based spatial clustering of applications with noise), which is meant to handle a mix of widely varying sizes of datasets. Although, DBScan can require lots of computational power.<\/p>\n\n\n\n<p>Next, there is the situation where there are some clusters with lots of data and others with little. What might happen? There is a chance that the k-Means algorithm will not pick up on the light one. This is the case with Figure\u00a03-11.<\/p>\n\n\n\n<figure class=\"wp-block-image\" id=\"Fig11\"><img decoding=\"async\" src=\"https:\/\/learning.oreilly.com\/api\/v2\/epubs\/urn:orm:book:9781484250280\/files\/images\/480660_1_En_3_Chapter\/480660_1_En_3_Fig11_HTML.jpg\" alt=\"..\/images\/480660_1_En_3_Chapter\/480660_1_En_3_Fig11_HTML.jpg\"\/><figcaption class=\"wp-element-caption\"><strong><em>Figure 3-11.<\/em><\/strong>If there are areas of thin data, the&nbsp;k-Means&nbsp;algorithm may not pick them up<\/figcaption><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>The k-Means clustering algorithm, which is effective for large datasets, puts similar, unlabeled data into different groups. The first step is to select k, which is the number of clusters. To help with this, you can perform visualizations of that data to see if there are noticeable grouping areas. Here\u2019s a look at sample data, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":3326,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[441],"tags":[],"class_list":["post-3432","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-3-machine-learning"],"jetpack_featured_media_url":"https:\/\/workhouse.sweetdishy.com\/wp-content\/uploads\/2024\/08\/images-41-1.jpeg","_links":{"self":[{"href":"https:\/\/workhouse.sweetdishy.com\/index.php\/wp-json\/wp\/v2\/posts\/3432","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/workhouse.sweetdishy.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/workhouse.sweetdishy.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/workhouse.sweetdishy.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/workhouse.sweetdishy.com\/index.php\/wp-json\/wp\/v2\/comments?post=3432"}],"version-history":[{"count":1,"href":"https:\/\/workhouse.sweetdishy.com\/index.php\/wp-json\/wp\/v2\/posts\/3432\/revisions"}],"predecessor-version":[{"id":3433,"href":"https:\/\/workhouse.sweetdishy.com\/index.php\/wp-json\/wp\/v2\/posts\/3432\/revisions\/3433"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/workhouse.sweetdishy.com\/index.php\/wp-json\/wp\/v2\/media\/3326"}],"wp:attachment":[{"href":"https:\/\/workhouse.sweetdishy.com\/index.php\/wp-json\/wp\/v2\/media?parent=3432"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/workhouse.sweetdishy.com\/index.php\/wp-json\/wp\/v2\/categories?post=3432"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/workhouse.sweetdishy.com\/index.php\/wp-json\/wp\/v2\/tags?post=3432"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}