Если инфорация оказалась интересна и/или полезна, не побрезгуйте, оставьте комментарий ;)

воскресенье, 30 октября 2022 г.

Финальный отчет для курса "Базы данных и SQL в обработке и анализе данных" на Cousera



К концу своего периода безработицы прошел несколько курсов на Курсере. Один из них "Foundations of Data Science: K-Means Clustering in Python" или на русском "Базы данных и SQL в обработке и анализе данных".

В целом курс мне понравился. Вначале думаешь, что быстро пробегутся по К-средним и всё. А в действительности есть возможность потренировать Питон для новичков. Всего по-немногу: Pandas, Numpy, Sklearn, Matplotlib. Ожидается, что будешь как минимум в Питоне сам искать дополнительные сведения.

В конце курса надо сделать отчет. Публикую тут свой на английском. Хочу обратить внимание, на критерии оценки:
  1. the purpose of the Data Science project;
  2. description of the data;
  3. methods: how the data were analysed;
  4. summary of the results;
  5. recommendations for your client.
  6. The report is well written and includes all the required information.
  7. The report is written for a Client, not a specialist in Data Science.
  8. The length of the report is 2-4 pages.

Коллеги по курсу будут ставить проставлять баллы именно исходя из этих критериев (так сделана проверка). Мне один товарищ снизил оценки по пунктам 5 и 7.

Итак, отчёт:

Banknote authentication model

In our project we are trying to create a tool (model) that allows to determinate, if a banknotes is a genuine or forged. Banknote authentication dataset freely accessible on openml.org is used. Link to dataset is https://www.openml.org/search?type=data&status=active&id=1462&sort=runs

Dataset about distinguishing genuine and forged banknotes. Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. A Wavelet Transform tool was used to extract features from these images.

Original dataset contains 4 arrays of features:
  • V1. variance of Wavelet Transformed image (continuous)
  • V2. skewness of Wavelet Transformed image (continuous)
  • V3. curtosis of Wavelet Transformed image (continuous)
  • V4. entropy of image (continuous)
Class (target). Presumably 1 for genuine and 2 for forged
Description of dataset:
  • Dataset contains 2 columns with 1372 rows.
  • There are no missed data (neither nulls nor na).
  • The scale of both data in both columns are close to each other.
  • Mean values (0.433735 for V1 and 1.922353 for V2) are close to zero comparing with min(-7.042100 for V1 and -13.773100 for V2) and max (6.824800 for V1 and 12.951600 for V2) values. Standard deviations are 2.842763 for V1 and 5.869047 for V2.
  • Diagrams with V1 and V2 as Y and value index as X shows that there are two clusters in each column. the border is somewhere about 750 of index value.
  • There are no hard outliers

There are diagrams with point distribution below:

Point index to V1

Point index to V1

 

V1 to V2. Violent points corresponds to genuine banknotes and yellow points corresponds to forged

 

To determine if a banknote is genuine or forged, K-Mean mathematical method has been used. The advantages of this method are speed and learning without teacher. It allows to split the data points in requested number of cluster. In our case 2 clusters are needed.


K-Mean algorithm was run several time to verify is it stable for our dataset or not. As you can see on diagrams method can give very close results between iterations.9 learning run results



9 learning run results

Initial dataset has 762 positive (genuine) and 610 negative (forged) samples. Model in average recognize 598 genuine and 774 forged banknotes with average accuracy (calculated as [how many real and predicted points match]/[total number of points]) about 87,8%. It means that in sample dataset about 1204 banknotes are authenticated correctly and about 167 banknotes are authenticated incorrectly. In most cases the are false negative prediction: genuine banknote is recognized as forged one. It is better than false positive prediction. Suspect banknote can be verified manually or with another automatic method. But we can be pretty sure that real forged banknote is not passed the exam and will be filtered.

Points corresponding to genuine and forged banknotes are intersecting in the center area. I don’t managed to determinate the points in this are using only two features (V1 and V2) and K-Mean method. So if the banknotes features are laying in this area than additional verification is required.

Important notes:

  1. It is very important of learning input data. K-Mean model gives ones and zeros classifying the points. But in initial dataset ones and twos are used. So data should be transformed to comparable.

  2. Do not ignore data scaling. Before scaling the accuracy was about 65%. After scaling it becomes 87,8%.

     

    Result of clustering without scaling

  3. Learn the algorithms to be used. As centroids are initially chosen randomly it can change its place after each iteration. Upper centroid can be first in list in one run and can be lower in another run. In this case accuracy are blinking close to as 65% or 35% randomly (before scaling was applied). result binary should be inverted if centroids are not in right direction comparing with input sequence (ones than twos).

 

Тут положил тетрадку Jupyter'a и оригинал отчета в формате LibreOffice.