Day83

2021. 2. 16. 21:42ㆍ교육과정/KOSMO

키워드 : 의사결정나무 / 랜덤포레스트 / 숫자분류 /

****

1. 의사결정나무 (Decision Tree)

의사결정나무 (Decision Tree)¶

   (1) 장점
        - 만들어진 모델을 쉽게 시각화하여 이해하기 쉽다
        - 데이터의 스케일(scale)에 영향을 받지 않아서 특성(feature)의 정규화나 표준화 같은 전처리 과정이 필요없다

   (2) 단점
        - 사전 가지치기를 사용함에도 과대적합되는 경향이 있어서 일반화 성능이 좋지 않다

[예제 ] 붓꽃(Iris) 품종

꽃잎(petal)과 꽃받침(sepal)의 폭과 길이를 측정하여 품종을 예측한다
150개의 데이타에서 3가지 품종(setosa, versicolor, virginica)로 분류한다

In [1]:

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd

In [6]:

# 1. 데이타 로딩
iris = datasets.load_iris()

# 데이터 key 확인
print(iris.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [11]:

# 2. 데이터와 레이블 분리 변수 선언
X = iris.data
y = iris.target

# 데이터 미리보기
print(X[:3])
print(y[:3])

# 3. 데이타셋을 분리 ( 학습용:검증용 = 7:3 )
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]]
[0 0 0]

In [12]:

# 5. 트리 모델 생성하고 학습하기

from sklearn.tree import DecisionTreeClassifier

iris_tree = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)
iris_tree.fit(X_train, y_train)

Out[12]:

DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)

In [13]:

# 6. 정확도

print('학습용 데이터 정확도 : {:.3f}'.format(iris_tree.score(X_train, y_train)))
print('검증용 데이터 정확도 : {:.3f}'.format(iris_tree.score(X_test, y_test)))

학습용 데이터 정확도 : 0.981
검증용 데이터 정확도 : 0.978

In [19]:

#7. 예측
from sklearn.metrics import accuracy_score

new_data = np.array([[5.2, 3.8, 1.5, 0.3]])
print(iris_tree.predict(new_data))

y_pred = iris_tree.predict(X_test)
print('정확도 : {:.3f}'.format(accuracy_score(y_test, y_pred)))

[0]
정확도 : 0.978

결정트리 시각화¶

[ 참고 ] graphviz 설치

- Graphviz 프로그램 연결하는 라이브러리(??)
    > pip install graphviz


- 직접설치 필요
  :  https://graphviz.gitlab.io/_pages/Download/Download_windows.html

  (1) 다운받아 직접 설치

       > Windows > Stable 2.38 Windows install packages > 10 > release >  graphviz-2.38.msi 다운로드 받아 실행

  (2) chocolatey를 이용하여 설치

      [참고] window10에 초코라때 설치
            1- 먼저 Windows PowerShell (관리자)를 실행합니다.
            2- 한줄로 명령 입력

Set-ExecutionPolicy Bypass -Scope Process -Force; iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))

            3- 설치 확인
                > choco -v

            4- choco를 이용하여 설치
                > choco install graphviz

            5- graphviz 설치 확인
                C:\Program Files\Graphviz

환경변수 지정 (소스에서 ) : C:/Program Files/Graphviz/bin/

    import os

    os.environ['PATH'] += os.pathsep + 'C:/Program Files/Graphviz/bin/'

In [25]:

!pip install graphviz

Collecting graphviz
  Downloading graphviz-0.16-py2.py3-none-any.whl (19 kB)
Installing collected packages: graphviz
Successfully installed graphviz-0.16

In [26]:

# pydotplus 설치
! pip install pydotplus

Collecting pydotplus
  Downloading pydotplus-2.0.2.tar.gz (278 kB)
Requirement already satisfied: pyparsing>=2.0.1 in c:\users\kosmo_03\anaconda3\lib\site-packages (from pydotplus) (2.4.7)
Building wheels for collected packages: pydotplus
  Building wheel for pydotplus (setup.py): started
  Building wheel for pydotplus (setup.py): finished with status 'done'
  Created wheel for pydotplus: filename=pydotplus-2.0.2-py3-none-any.whl size=24572 sha256=d8ba0bdfef983547a17dab91ed52cb2780e35f016fdb55b79262034c1bdae8b4
  Stored in directory: c:\users\kosmo_03\appdata\local\pip\cache\wheels\fe\cd\78\a7e873cc049759194f8271f780640cf96b35e5a48bef0e2f36
Successfully built pydotplus
Installing collected packages: pydotplus
Successfully installed pydotplus-2.0.2

In [27]:

from sklearn.tree import export_graphviz
import pydotplus
import graphviz
from IPython.display import Image

In [28]:

# graphviz의 경로를 환경변수 PATH에 등록
import os
os.environ['PATH'] += os.pathsep + 'C:/Program Files/Graphviz/bin'

dot_data = export_graphviz(iris_tree, out_file=None, feature_names=iris.feature_names,
                          class_names=iris.target_names, filled=True, rounded=True, special_characters=True)

#dot_data = export_graphviz(iris_tree, out_file=None, feature_names=['petal length', 'petal width'],
#                          class_names=iris.target_names, filled=True, rounded=True, special_characters=True)


# 그래프 생성
graph = pydotplus.graph_from_dot_data(dot_data)
# 그래프를 이미지로 변환
Image(graph.create_png())

Out[28]:

In [114]:

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [117]:

In [126]:

[0.2  1.4  0.94]
[0.   0.8  1.32]
[-0.2  0.2  1.7]
[-0.2   0.32  2.12]
[-0.4  -0.7   1.84]
[-0.4  -0.7   1.84]
[-0.4  -0.7   1.84]
[-0.4  -0.7   1.84]
[-0.4  -0.7   1.84]
[-0.4  -0.7   1.84]
[1, 3, 3, 2, 1, 0, 0, 0, 0, 0]

2. 랜덤포레스트

랜덤포레스트 : 결정트리(Decision Tree)의 앙상블(ensemble)¶

앙상블(ensemble) : 머신러닝 모델을 연결하여 더 강력한 모델을 만드는 기법
DecisionTree의 주요 단점인 훈련 데이터에 과대적합되는 경향이 있다는 문제를 여러 결정 트리를 많이 만들어 그 결과를 평균냄으로써 과대적합된 양을 줄여서 회피한다.
학습데이타를 기반으로 다수의 의사결정트리를 만들고, 만들어진 의사결정트리를 기반으로 다수결로 결과를 유도하는 방식으로 정밀도가 높은 알고리즘이다.

[ 힌트 ] 기존의 모델했던 예제에서 모델링만 변경하면 된다¶

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100)

매개변수
n_estimators : 랜덤포레스트 모델을 만들 때 생성할 트리의 갯수
random_state : 램덤 시드값으로 같은 결과를 만들려면 random_state 값을 고정한다.
그러나 랜덤포레스트의 트리가 많을수록 random_state 값의 변화에 따른 변동이 적다

In [7]:

import pandas as pd
from sklearn import svm, metrics
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

# (1) 데이타 읽어오기 (pandas 이용)
csv = pd.read_csv('../data/iris/iris.csv')
print(csv[:10]) # 컬럼명이 0번째 레코드(샘플)과 분리되어 있음

# (2) 데이터와 레이블 분리 변수 선언 (실제 iris.csv 파일 컬럼명 확인)
csv_data = csv[['sepal.length','sepal.width','petal.length','petal.width']]
csv_label = csv['variety']
print(csv_data.head())
print(csv_label.head())

# (3) 훈련데이터와 테스트 데이터로 분리하기
X_train, X_test, y_train, y_test = train_test_split(csv_data, csv_label)

# (4) 분류모델로 학습하기
#     예를 들면, clf = svm.SVC(gamma='auto')  이렇게 학습 모델 지정하기
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

# (5) 예측하기
y_pred = clf.predict(X_test)

# (6) 검증하기
ac_score = metrics.accuracy_score(y_test, y_pred)
print('정확도 : ', ac_score)

   sepal.length  sepal.width  petal.length  petal.width variety
0           5.1          3.5           1.4          0.2  Setosa
1           4.9          3.0           1.4          0.2  Setosa
2           4.7          3.2           1.3          0.2  Setosa
3           4.6          3.1           1.5          0.2  Setosa
4           5.0          3.6           1.4          0.2  Setosa
5           5.4          3.9           1.7          0.4  Setosa
6           4.6          3.4           1.4          0.3  Setosa
7           5.0          3.4           1.5          0.2  Setosa
8           4.4          2.9           1.4          0.2  Setosa
9           4.9          3.1           1.5          0.1  Setosa
   sepal.length  sepal.width  petal.length  petal.width
0           5.1          3.5           1.4          0.2
1           4.9          3.0           1.4          0.2
2           4.7          3.2           1.3          0.2
3           4.6          3.1           1.5          0.2
4           5.0          3.6           1.4          0.2
0    Setosa
1    Setosa
2    Setosa
3    Setosa
4    Setosa
Name: variety, dtype: object
정확도 :  0.9473684210526315

In [9]:

cl_report = metrics.classification_report(y_test, y_pred)
print(cl_report)

              precision    recall  f1-score   support

      Setosa       1.00      1.00      1.00        12
  Versicolor       0.85      1.00      0.92        11
   Virginica       1.00      0.87      0.93        15

    accuracy                           0.95        38
   macro avg       0.95      0.96      0.95        38
weighted avg       0.96      0.95      0.95        38

[참고] 데이터사이언스 스쿨 - 분류 성능평가

https://datascienceschool.net/03%20machine%20learning/09.04%20%EB%B6%84%EB%A5%98%20%EC%84%B1%EB%8A%A5%ED%8F%89%EA%B0%80.html?highlight=%EB%B6%84%EB%A5%98%20%EC%84%B1%EB%8A%A5%ED%8F%89%EA%B0%80

3. 숫자분류 (1)

1. 손글씨 숫자¶

원본 데이타
- UCI Machine Leanring Repository 에 공개
- http://archive.ics.uci.edu/ml/datasets > optical recognition of handwritten digits
- 8x8 픽셀의 2차원 배열의 데이타가 5620개

scikit-learn 에 있는 손글씨 숫자 데이타

  from sklearn.datasets import load_digits
  digits = load_digits()

데이타 구조

  digit.images - 이미지 데이터 배열
  digit.target - 데이터가 어떤 숫자인지 (레이블)

In [1]:

# 숫자 데이타 로드하여 숫자 로드
from sklearn.datasets import load_digits
digits = load_digits()

print(digits.images[0])
print(digits.target[0])

[[ 0.  0.  5. 13.  9.  1.  0.  0.]
 [ 0.  0. 13. 15. 10. 15.  5.  0.]
 [ 0.  3. 15.  2.  0. 11.  8.  0.]
 [ 0.  4. 12.  0.  0.  8.  8.  0.]
 [ 0.  5.  8.  0.  0.  9.  8.  0.]
 [ 0.  4. 11.  0.  1. 12.  7.  0.]
 [ 0.  2. 14.  5. 10. 12.  0.  0.]
 [ 0.  0.  6. 13. 10.  0.  0.  0.]]
0

In [2]:

# 이미지를 회색 스케일로 변환
%matplotlib inline
import pylab as pl

pl.gray()
pl.imshow(digits.images[0]);

In [3]:

# 10개 출력해보기
import matplotlib.pyplot as plt

for i in range(10):
    plt.subplot(2, 5, i+1)
    plt.imshow(digits.images[i])

2. 이미지 머신러닝 - LinearSVC이용¶

학습 데이타   - 80 %
테스트 데이타 - 20 %

In [4]:

from sklearn.model_selection import train_test_split
from sklearn import datasets, svm, metrics
from sklearn.metrics import accuracy_score

# 데이터 읽어 들이기 --- (*1)
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.images       # 각 데이터가 8*8 의 2차원 배열로 되어 있음
y = digits.target

# 2차원 배열을 1차원 배열로 변환하기 --- (*2)
X = X.reshape((-1, 64))     # reshape는 a행 b열로 만들어주며 a를 -1로 입력할 경우 b에 따라 a를 자동으로 지정
print(X[0])

# 데이터를 학습 전용과 테스트 전용으로 분리하기 --- (*3)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 데이터 학습하기 --- (*4)
model = svm.LinearSVC()
model.fit(X_train, y_train)

# 예측하고 정답률 출력하기 --- (*5)
y_pred = model.predict(X_test)
print('정확도 : ', accuracy_score(y_test, y_pred))

[ 0.  0.  5. 13.  9.  1.  0.  0.  0.  0. 13. 15. 10. 15.  5.  0.  0.  3.
 15.  2.  0. 11.  8.  0.  0.  4. 12.  0.  0.  8.  8.  0.  0.  5.  8.  0.
  0.  9.  8.  0.  0.  4. 11.  0.  1. 12.  7.  0.  0.  2. 14.  5. 10. 12.
  0.  0.  0.  0.  6. 13. 10.  0.  0.  0.]
정확도 :  0.9472222222222222

C:\Users\kosmo_03\anaconda3\lib\site-packages\sklearn\svm\_base.py:976: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn("Liblinear failed to converge, increase "

[결과] 실행할 때마다 정답률이 달라진다. 일반적으로 0.93~0.96 정도의 정답률이 나온다

In [ ]:

# 예측값과 실제 숫자 비교

[ 참고 ] 파이썬 기본 함수 : https://wikidocs.net/32 ¶

*zip()와 list()

list(zip([1, 2, 3], [4, 5, 6])) -> [(1, 4), (2, 5), (3, 6)]

list(zip([1, 2, 3], [4, 5, 6], [7, 8, 9])) -> [(1, 4, 7), (2, 5, 8), (3, 6, 9)]

list(zip("abc", "def")) -> [('a', 'd'), ('b', 'e'), ('c', 'f')]

3. 작성한 이미지 판단하기¶

- 그림판에서 정사각형 크기(동일 픽셀)를 만들고 검정색 굵은 선으로 숫자를 그리고 'my.png'로 저장한다
  ( 직접 손글씨 이미지 만들기  ( 200px * 200px ) )

- open cv를 이용하여 이미지를 픽셀데이타로 변경한다

open cv
- https://opencv.org

[ 참고 ] opencv 관련

In [5]:

# opencv 인스톨
!pip install opencv-python

Collecting opencv-python
  Downloading opencv_python-4.5.1.48-cp38-cp38-win_amd64.whl (34.9 MB)
Requirement already satisfied: numpy>=1.17.3 in c:\users\kosmo_03\anaconda3\lib\site-packages (from opencv-python) (1.19.2)
Installing collected packages: opencv-python
Successfully installed opencv-python-4.5.1.48

In [7]:

import cv2

def predict_digit(filename):

    # 직접 그린 손글씨 이미지 읽어 들이기
    my_img = cv2.imread(filename, cv2.IMREAD_COLOR)
    
    # 이미지 데이터를 학습에 적합하게 변환하기
    my_img = cv2.cvtColor(my_img, cv2.COLOR_BGR2GRAY)
   
    
    my_img = cv2.resize(my_img, (8, 8))
    my_img = 15 - my_img // 16 # 흑백 반전
    # 2차원 배열을 1차원 배열로 변환하기
    my_img = my_img.reshape((-1, 64))
    # 데이터 예측하기
    res = model.predict(my_img)
    return res[0]

In [11]:

# 이미지 파일을 지정해서 실행하기
n = predict_digit('img/my5.png')
print('결과', str(n))

결과 7

In [ ]:

[참고] 나이브베이즈 분류기 중 가우스 분포를 사용해 분류¶

위에서 LinearSVC 예제와 비교만하자

In [12]:

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
import pylab as pl

In [13]:

# 숫자 데이타 로드
digits = load_digits()

# 목표변수 
y = digits.target

# 데이타준비
n_sample = len(digits.images)
X = digits.images.reshape((n_sample, -1)) 
    # reshape() : 10*10 행렬을 100 벡터(리스트?)로 변경
print(X)    

[[ 0.  0.  5. ...  0.  0.  0.]
 [ 0.  0.  0. ... 10.  0.  0.]
 [ 0.  0.  0. ... 16.  9.  0.]
 ...
 [ 0.  0.  1. ...  6.  0.  0.]
 [ 0.  0.  2. ... 12.  0.  0.]
 [ 0.  0. 10. ... 12.  1.  0.]]

In [14]:

# 데이타셋과 훈련셋으로 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print('train : ', len(X_train), ', test : ' , len(X_test))

train :  1347 , test :  450

In [15]:

# 나이브 베이즈 분류기 선택(Classifier) - 가우스분포를 사용해 확률을 추산
gnb = GaussianNB()
fit = gnb.fit(X_train, y_train)  # 데이타 적합화

# 예측하기
predicted = fit.predict(X_test)

# Confusion matrix 생성
#   : 예측된 결과가 얼마나 혼돈(잘못)되었는지 볼 수 있는 2차원 배열
confusion_matrix(y_test, predicted)

'''
 [결과] 2라고 예측했는데 8인 경우가 15번, 8이라고 예측했는데 2인 경우가 5이다.
'''

Out[15]:

'\n [결과] 2라고 예측했는데 8인 경우가 15번, 8이라고 예측했는데 2인 경우가 5이다.\n'

In [16]:

# 예측값과 실제 숫자 비교
images_and_predictions = list(zip(digits.images, fit.predict(X)))
for index, (image, prediction) in enumerate(images_and_predictions[:10]):
    pl.subplot(5, 3, index+5 )  # 5행 3열 subplot을 만들고 5번째부터 지정(?)
    pl.axis('off')  # 축을 나타내기 않음
    pl.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest') # 서브플롯으로 채워진 전체 플롯을 보여준다
    pl.title('prediction: %i' % prediction )

pl.show()

[결과] 8이라고 예측했는데 2라는 것이다. 3이라고 예측했는데 아마도 5인 듯

` 이 예제는 숫자를 확인하고자 하는 샘플예문이다

4. 숫자분류 (2)

손글씨 숫자 인식하기¶

` mnist에서 공개하는 데이터 사용
` http://yann.lecun.com/exdb/mnist
` 6만개의 학습용 데이터와 1만개의 검증용 데이터로 구성됨 ( 총 7만개 데이타 )

train-images-idx3-ubyte.gz:  training set images (9912422 bytes) 
train-labels-idx1-ubyte.gz:  training set labels (28881 bytes) 
t10k-images-idx3-ubyte.gz:   test set images (1648877 bytes) 
t10k-labels-idx1-ubyte.gz:   test set labels (4542 bytes)

위 4개의 파일에는 수많은 손글씨 이미지가 자체적인 데이타베이스 형식으로 들어 있다.
해당 형식을 분석해야 이미지 같은 데이터를 얻어서 활용할 수 있다.
제공되는 4개의 압축파일을 다운받아 Gzip 압축을 우선 해제해야 한다.

train-images-idx3-ubyte.gz : 6만개의 이미지 정보 저장
train-labels-idx1-ubyte.gz : 해당 이미지에 어떤 숫자인지 정보 저장

[결과] 압축푼 4개의 파일의 제목을 보며 위의 사이트에서 확인하면 구조 확인   

[작업] 이미 data 폴더와 mnist 폴더를 생성한다

1. 데이타 다운받아 압축풀기¶

In [1]:

import urllib.request as req
import gzip, os, os.path
savepath = "./mnist"
baseurl = "http://yann.lecun.com/exdb/mnist"
files = [
    "train-images-idx3-ubyte.gz",
    "train-labels-idx1-ubyte.gz",
    "t10k-images-idx3-ubyte.gz",
    "t10k-labels-idx1-ubyte.gz"]

# 다운로드
if not os.path.exists(savepath): os.mkdir(savepath)
for f in files:
    url = baseurl + "/" + f
    loc = savepath + "/" + f
    print("download:", url)
    if not os.path.exists(loc):
        req.urlretrieve(url, loc)

# GZip 압축 해제
for f in files:
    gz_file = savepath + "/" + f
    raw_file = savepath + "/" + f.replace(".gz", "")
    print("gzip:", f)
    with gzip.open(gz_file, "rb") as fp:
        body = fp.read()
        with open(raw_file, "wb") as w:
            w.write(body)
print("ok")

download: http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
download: http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
download: http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
download: http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
gzip: train-images-idx3-ubyte.gz
gzip: train-labels-idx1-ubyte.gz
gzip: t10k-images-idx3-ubyte.gz
gzip: t10k-labels-idx1-ubyte.gz
ok

2. 바이너리 파일들을 CSV파일로 변환하기¶

 압축푼 파일들을 바이러니 값들이기에 다루기 어렵다
 이 바이러니 파일들을 다루기 쉬운 CSV파일로 변환할 것이다.

 [ 다음과 같은 CSV 파일 ]
 레이블, 28 * 28의 픽셀 데이타
 5, 0,0,0,0,0,0, 30, 80, 100, . . . 

 [결과] train.csv와 t10k.csv 학습테이타와 테스트데이타 생성하고
 확인할 수 있도록 10개씩 이미지 파일(pgm)로 저장한다.
 해당 pgm 파일은 볼 수 있는 이미지뷰어 프로그램을 설치하면 볼 수 있다

In [2]:

import struct

def to_csv(name, maxdata):
    # 레이블 파일과 이미지 파일 열기
    # data폴더를 미리 생성하기
    lbl_f = open("./mnist/"+name+"-labels-idx1-ubyte", "rb") # r: 열기 rb: 바이너리 파일 열기
    img_f = open("./mnist/"+name+"-images-idx3-ubyte", "rb")
    csv_f = open("./data/"+name+".csv", "w", encoding="utf-8")


    # 헤더 정보 읽기 ---
	# 이미지 크기 정보 8바이트
    mag, lbl_count = struct.unpack(">II", lbl_f.read(8))
    mag, img_count = struct.unpack(">II", img_f.read(8))
    rows, cols = struct.unpack(">II", img_f.read(8))
    pixels = rows * cols
    # 이미지 데이터를 읽고 CSV로 저장하기 --- 
    res = []
    for idx in range(lbl_count):
        if idx > maxdata: break
		# 원하는 바이너리 수만큼 읽어들이고 정수로 변환하기
        label = struct.unpack("B", lbl_f.read(1))[0]
        bdata = img_f.read(pixels)
        sdata = list(map(lambda n: str(n), bdata))
        csv_f.write(str(label)+",")
        csv_f.write(",".join(sdata)+"\r\n")
        # 잘 저장됐는지 이미지 파일로 저장해서 테스트하기 --
        if idx < 10:
            s = "P2 28 28 255\n"
            s += " ".join(sdata)
            iname = "./data/{0}-{1}-{2}.pgm".format(name,idx,label)
            with open(iname, "w", encoding="utf-8") as f:
                f.write(s)
    csv_f.close()
    lbl_f.close()
    img_f.close()
    

In [5]:

# ----------------------------    
# 결과를 파일로 출력하기 
# 6만 데이타 처리가 오래걸리기 때문에 학습데이타 1000개, 테스트데치타 500개를 CSV에 저장한다.
to_csv('train', 1000)      # 제대로 할 때는 60000 로 수정
to_csv('t10k', 500)        # 제대로 할 때는 10000 로 수정
# data 폴더에 파일 생성됨

In [11]:

to_csv('train', 60000)  
to_csv('t10k', 10000)

3. csv 파일 읽어서 학습데이타와 테스트데이타 준비¶

In [12]:

from sklearn import model_selection, svm, metrics

"""
 CSV 파일을 읽어 들이고 가공하기 
	 [ 다음과 같은 CSV 파일 ]
	 레이블, 28 * 28의 픽셀 데이타
	 5, 0,0,0,0,0,0, 30, 80, 100, . . . 
"""

def load_csv(fname):
    labels = []
    images = []
    with open(fname, "r") as f:
        for line in f:
            cols = line.split(",")
            if len(cols) < 2: continue
            labels.append(int(cols.pop(0)))
			# 이미지 데이터의 각 픽셀은 0~255 정수인데 255로 나누어 실수로 변경
			# lambda 이해 : https://wikidocs.net/64
            vals = list(map(lambda n: int(n) / 256, cols))
            images.append(vals)
    return {"labels":labels, "images":images}

data = load_csv("./data/train.csv")
test = load_csv("./data/t10k.csv")

4. 머신러닝 학습하기 (SVC)¶

In [13]:

clf = svm.SVC()
clf.fit(data['images'], data['labels'])

Out[13]:

SVC()

5. 예측하고 결과확인¶

In [14]:

y_pred = clf.predict(test['images'])

ac_score = metrics.accuracy_score(test['labels'], y_pred)
print(ac_score)

0.9792

[결과] 79% 정답률

[다시 확인] 
to_csv("train", 10000)
to_csv("t10k", 5000)

데이타 추출을 크게하고 다시 실행하면 속도가 엄청 느림
그러나 결과는 90% 이상 나온다

저작자표시

'교육과정 > KOSMO' 카테고리의 다른 글

Day84 (0)	2021.02.17
Day82 (0)	2021.02.11
Day81 (0)	2021.02.05
Day80 (0)	2021.02.04
Day79 (0)	2021.02.03

카이옌의 세계

카이옌의 세계

태그

최근글

댓글

공지사항

아카이브

의사결정나무 (Decision Tree)¶

결정트리 시각화¶

랜덤포레스트 : 결정트리(Decision Tree)의 앙상블(ensemble)¶

[ 힌트 ] 기존의 모델했던 예제에서 모델링만 변경하면 된다¶

1. 손글씨 숫자¶

2. 이미지 머신러닝 - LinearSVC이용¶

[ 참고 ] 파이썬 기본 함수 : https://wikidocs.net/32 ¶

3. 작성한 이미지 판단하기¶

[참고] 나이브베이즈 분류기 중 가우스 분포를 사용해 분류¶

손글씨 숫자 인식하기¶

1. 데이타 다운받아 압축풀기¶

2. 바이너리 파일들을 CSV파일로 변환하기¶

3. csv 파일 읽어서 학습데이타와 테스트데이타 준비¶

4. 머신러닝 학습하기 (SVC)¶

5. 예측하고 결과확인¶

'교육과정 > KOSMO' 카테고리의 다른 글

관련글

티스토리툴바

카이옌의 세계

태그

최근글

댓글

공지사항

아카이브

의사결정나무 (Decision Tree)¶

결정트리 시각화¶

랜덤포레스트 : 결정트리(Decision Tree)의 앙상블(ensemble)¶

[ 힌트 ] 기존의 모델했던 예제에서 모델링만 변경하면 된다¶

1. 손글씨 숫자¶

2. 이미지 머신러닝 - LinearSVC이용¶

[ 참고 ] 파이썬 기본 함수 : https://wikidocs.net/32¶

3. 작성한 이미지 판단하기¶

[참고] 나이브베이즈 분류기 중 가우스 분포를 사용해 분류¶

손글씨 숫자 인식하기¶

1. 데이타 다운받아 압축풀기¶

2. 바이너리 파일들을 CSV파일로 변환하기¶

3. csv 파일 읽어서 학습데이타와 테스트데이타 준비¶

4. 머신러닝 학습하기 (SVC)¶

5. 예측하고 결과확인¶

'교육과정 > KOSMO' 카테고리의 다른 글

관련글

티스토리툴바

[ 참고 ] 파이썬 기본 함수 : https://wikidocs.net/32 ¶