Proglam: ML

레이블이 ML인 게시물을 표시합니다. 모든 게시물 표시

2019년 8월 1일 목요일

Convolution vs Cross-correlation

참조

https://tensorflow.blog/2017/12/21/convolution-vs-cross-correlation/

Andrew Ng CNN

이 글은 위 강의와 게시글을 참조하여 개인적으로 정리한 글입니다.

Convolution VS Cross-correlation

CNN(Convolutional Neural Network)에서 사용되는 convolution 연산은 수학에서 의미하는 convolution 연산과 다르다.

CNN에서 사용하는 convolution 연산은 실제로 cross-correlation이라 부르고, 연산 행렬과 피연산 행렬의 같은 위치의 값들을 각각 곱한 후 모두 더한 값을 출력으로 낸다.

수학에서 의미하는 convolution 연산은 조금 달리 연산 행렬의 가로행과 세로행을 바꾼 후 값들을 각각 곱하고 모두 더하는 연산을 한다.

가로행과 세로행을 바꾸는 부분을 넣지 않아도 실제 연산결과와 크게 차이가 없기 때문에 그 과정을 생략한 후 계산하지만 Machine Learning 에서는 통상적으로 Convolutional neural network라고 부른다.

2019년 4월 9일 화요일

[세미나 발표] Logistic Regression Classification

참조

아래의 게시글은 모두의 딥러닝 강의를 참조하여 재가공한 글입니다.

Classification이란?

Classification은 Supervised Learning의 일종이다. 학습을 위해 이미 분류된 데이터를 주는데, 이 데이터를 기반으로 데이터의 Category 관계를 파악한다.

주어진 데이터로 학습을 완료한 후에는 새롭게 주어진 데이터를 Category에 분류할 수 있게 된다.

Spam Email Detection: Spam (1) or Ham (0)
Facebook feed: Show (1) or hide (0)
Credit Card Fraudulnet Transaction Detection: Legitimate (1) or Fraud (0)

Logistic Regression Classification

어떻게 Classification 할 것인가?

그렇다면 Logistic Regression Classification은 어떻게 그룹을 나눈다는 것일까? 우선 가장 간단한 모델인 Linear 모델을 통해 Classification하면서 생각해보자.

위와 같은 데이터가 주어졌다고 가정하고, 이 데이터를 Classification해보자. Linear한 모델을 통해 Classification한다면 다음과 같은 모델을 예상할 수 있을 것이다.

2개의 그룹으로 데이터를 나눠야 하기 때문에 y축을 절반으로 나눌 수 있는 값을 0.5로 정하면 데이터를 두 그룹으로 나누는 x축의 값을 구할 수 있다. 이제 이 모델을 사용하여 데이터가 0그룹에 속하는지 1그룹에 속하는지 데이터의 hours의 값으로 구분할 수 있게 되었다.

이 모델은 얼핏 완성된 것처럼 보이지만, Linear 모델은 한계를 가진다. 새로운 데이터 가 들어왔을 때를 살펴보자.

현재까지 학습된 모델은 새로운 데이터와 맞지 않으니 새로운 데이터를 포함할 수 있는 모델을 다시 학습해야 한다.

새로운 데이터를 포함한 새로운 모델을 만들었으나 문제가 생긴다. 이전에 1에 속했던 데이터들이 더 이상 1그룹에 속하지 못하게 되었다.

이 예시는 다소 비약적일 수 있으나 Linear 모델의 한계를 잘 보여주는 예시이다. 위의 예시에서 볼 수 있는 문제 외에도 Linear 모델의 경우 학습한 데이터의 범위를 넘어가는 경우가 생길 수 있기 때문에 문제가 생길 수 있다.

Logistic Regression

그런 Linear한 모델의 한계점을 보완하기 위해 사용할 수 있는 방법중 하나가 Logistic Regression이다.

Logistic Regression은 통계 기법으로 일종의 확률 모델으로(로지스틱의 자세한 사항은 이 블로그가 도움되더라.) 아래 그림과 같은 모양을 가진다.

Logistic Regression Classification

Logistic Regression Classification은 Linear 모델을 Logistic Regression을 지나도록 하려 Non-Linear하게 만들어 Linear 모델의 한계점을 보완한 Classification을 말한다.

Linear 모델에 Logistic Regression만 추가하여 사용하면 좋겠지만, Linear 모델에 Logistic Regression만 추가하면 문제가 생긴다.

머신러닝은 문제를 해결하는데 최적의 weight 값을 찾아가는 과정이다. 그렇기 때문에 cost(비용, err값)을 구하고 구한 cost를 통해 weight값을 수정하는 과정이 필수적이다.

그래서 Linear 모델에 Logistic Regression을 추가시킨 후에도 cost를 구하고 weight를 수정해야 한다. Linear Regression에서 사용한 Cost Function은 아래 그림과 같다.

이 그래프의 장점은 Convex하다는 것이다. Convex한 함수는 처음 시작 지점이 어디든 최저점을 찾을 수 있는 매우 편리한 그래프라는 것이다.

Linear Regression에서 사용한 Cost Function은 Convex한 그래프이지만 Logistic Regression을 사용한 모델의 Cost Function은 Convex한 모양을 유지하지 못한다.

그렇기 때문에 Logistic Classification은 Cost Function을 변형하여 사용한다.

2019년 3월 7일 목요일

[머신러닝인액션]1.1 기계학습 기초

개요

개인적인 정리이므로, 생략/추가하여 정리한다.

"기계학습은 데이터의 의미를 만드는 것이다."

기계학습의 알고리즘

분류, 회귀, 군집, 밀도 추정을 수행하기 위한 일반적인 알고리즘

지도학습과 비지도학습을 나누는 정의와 분류, 회귀를 학습의 목적으로 정의해야할지, 학습 방법으로 정의해야 할지 헷갈렸는데 이 책에서 명확하게 나와있어서 좋았다.

이 책에서는 기계학습을 데이터의 의미를 만드는 것이라고 정의하였다. 이 말은 기계학습은 데이터를 가공하는 것이라는 의미이다. 그렇다면 지도학습과 비지도학습도 데이터를 가공하는 방법을 크게 두 가지로 나누는 것이다.
지도 학습 방법의 분류, 회귀, k-최근접 이웃 등도 마찬가지로 (데이터에 label을 붙이는) 지도 학습으로 데이터를 정리하고, 어떤 모양으로 데이터를 만들 것이냐의 문제인 것이다.

올바른 알고리즘 선정 방법

위에서 설명한 알고리즘들은 모두 같은 문제를 해결하기 위한 다양한 방법이라고 할 수 있다. 그렇다면 우리는 한 가지 의문을 가질 수 있다. "만약, 같은 것을 하고자 하는 것이라면, 왜 다른 방법이 네 개나 있어야 하는 걸까? 왜 단 하나의 방법만 선택할 수는 없는 것인가?"

해결하고 싶은 문제에 가장 알맞은 해결방법을 모르고, 가장 알맞은 해결방법을 선택하고 싶기 때문이라고 답할 수 있다. 그렇다면 알고리즘 중에서 하나를 사용할 때, 어떻게 선택해야 할까?

(아래에 나열될 질문들에 대해 간결하게 답하기 위해 YES/NO 로 답한다.)

목적을 고려해야 한다.

목적 값을 예측하거나 예견하려고 하는가? 지도학습방법 / 비지도학습방법
지도학습방법이라면, 목적 값은 무엇인가? 이산적인가? 분류 /
지도학습방법이라면, 목적 값은 무엇인가? 수치적인가? 회귀 /
비지도학습방법이라면, 어떤 이산적인 집합에 속하는지 알아보려고 하는가? 군집화 /
비지도학습방법이라면, 각각의 집합에 속하는 정도를 수치적으로 평가할 것인가? 밀도 추정 알고리즘 /

보유하고 있는 데이터를 고려해야만 한다.

속성이 명목형인가? 연속형인가?
속성 내에 누락된 값이 있는가?
누락된 값이 있다면 데이터가 누락된 상황은 왜 존재하는가?
데이터 내에 오류 데이터가 있는가?
매우 드물게 발생하는 어떤 것이 존재하는가?

기계학습의 공통적인 단계

기계 학습 알고리즘이 구축될 때 모든 알고리즘이 꼭 해야만 하는 공통적인 단계들이 있다.

데이터 수집
입력 데이터 준비가지고 있는 데이터를 유용한 형식으로 만들어야 한다. 이 단계에서는 알고리즘마다 다른 특유의 형식을 만들어야 한다.
어떤 알고리즘은 특별한 형식의 속성을 요구하기도 하고, 문자열처럼 목적 변수와 속성을 대응시킬 수 있고, 또 어떤 경우는 속성이 정수가 되도록 해야 한다.
입력 데이터 분석
이전 작업을 바탕으로 데이터를 주의 깊게 보는 단계이다.
알고리즘 훈련
기계 학습이 이루어지는 부분이다. 이번 단계와 다음 단계에서는 핵심 알고리즘을 다룬다. 처음 두 단계에서 얻어진 잘 정제된 데이터를 알고리즘에 넣어 지식이나 정보를 추출한다. 이러한 지식은 종종 하나의 형식으로 저장되며, 다음 두 단계에서 기계를 이용해 이를 손쉽게 사용하도록 한다.
비지도 학습으 경우 목적 값을 가지지 않기 때문에 훈련 단계가 없다.
알고리즘 테스트
이전 단계에서 학습된 정보를 사용하는 단계이다. 훈련이 얼마나 잘 되었는지 알아보기 위해 학습된 정보를 테스트하는 것이다.
지도학습의 경우, 알고리즘을 평가하는 데 사용할 수 잇는 몇 가지 알려진 값을 가진다. 비지도학습의 경우, 성공 여부를 평가하기 위해 다른 통계를 사용하게 될 것이다.

두 경우 모두가 만족스럽지 않다면, 4 단계로 돌아가 몇 가지를 변경하고 다시 테스트를 시도할 수 있다. 데이터를 수집하고 준비하는 과정에서 문제가 있었을 수도 있으며, 이러한 경우에는 1 단계로 돌아가야 한다.
사용하기
몇 가지 작업을 위해 실질적인 프로그램으로 만들고 이전의 모든 단계가 예상했던 것처럼 진행되었는지 다시 살펴본다. 몇 가지 새로운 데이터를 적용하여 1-5 단계를 다시 살펴보아야 한다.

[ML] Jensen-Shannon Divergence, Kullback-Leibler Divergence

참고

2019년 3월 6일 수요일

L1, L2 Regularization(최적화)

출처

* 보충 필요

Regularization 정의

특성 교차(Feature Cross)

두 개 이상의 특성을 곱하여(교차하여) 구성되는 합성 특성이다. 여러 특성을 교차하면 특성을 개별로 예측하는 것보다 좋은 성능을 기대할 수 있다.

과적합(Overfitting)

훈련 데이터를 통해 모델을 만들면서, 훈련 데이터에만 지나치게 적응하면 생기는 현상이다.

Overfitting하게 되면, 훈련 데이터가 아닌 다른 데이터에는 제대로 반응하지 못하게 된다. 주로 아래와 같은 경우에 발생한다.

매개변수가 많고 표현력이 높은 모델인 경우
훈련 데이터가 적은 경우

최적화(Regularization)

성능을 높이기 위해 특성 교차를 생성하면 더 많은 차원이 발생한다. 이 때, 차원의 개수가 많아지만 모델의 크기도 커지며 그로 인해 엄청난 양의 RAM이 필요해진다. 모델의 크기가 커지기 때문에 과적합 문제도 나타나게 된다.

그렇다면 이런 문제를 해결하기 위해서는 어떻게 해야할까?

최적화(Regularization)을 통해 이를 해결할 수 있다.

너무 fitting 된 모델을 범용성을 갖도록 처리하는 방법이다.

Regularization 방법

모델의 크기, RAM을 어떤 방법으로 줄일 수 있을까?

몇몇 가중치를 0으로 만들어 feature의 개수를 줄이면 된다. 이 때, 아무 feature를 없애면 모델의 성능이 떨어지게 되므로 모델의 성능을 떨어뜨리는 noise가 심한 feature를 없애면 알맞게 최적화를 할 수 있다.

Regularization은 가중치를 0으로 만들거나 줄이는 방법에 따라 나눌 수 있다.

가중치를 0으로 하여, feature의 값을 완전히 없애버린다. 이 방법은 cost function이 볼록 함수의 모양이 아니기 때문에 최소화하는 값을 찾기가 어렵다.

L1, Lasso Regression

모델의 비용 함순에 norm(가중치 각 요소 절대값의 합)을 패널티로 부과한다. 대부분의 요소값이 0인 sparse feature에 의존한 모델에서 L1 Regularization은 불필요한 feature에 대응하는 가중치들을 정확히 0으로 만들어 해당 feature를 모델이 무시하도록 한다. feature selection에 효과를 볼 수 있다. L1 Regularization은 아래의 수식으로 표현할 수 있다.

weight 값 자체를 줄이는 것이 아니라 w의 부호에 따라 상수 값을 빼주는 방식으로 regularization을 수행한다.

L2, Ridge Regression

모델의 비용 함수에 norm(가중치 각 요소의 제곱들의 합)을 패널티로 부과한다. L2 Regularization은 아래의 수식으로 표현할 수 있다.

학습의 방향이 단순하게 C_0 값이 작아지는 방향으로만 진행되는 것이 아니라, w 값들 역시 최소가 되는 방향으로 진행하게 된다.

L1 / L2 Regularization의 차이점과 선택 기준

L1 Regularization은 통상적으로 상수 값을 빼주도록 되어 있기 때문에 작은 가중치들은 거의 0으로 수렴이 되어, 몇 개의 중요한 가중치들만 남게 된다.

그러므로 몇 개의 의미있는 값을 끄집어내고 싶은 경우에는 L1 Regularization이 효과적이기 때문에 "sparse model(coding)"에 적합하다. 단, 미분이 불가능한 점이 있기 때문에 gradient-based learning에 적용할 때는 주의가 필요하다.

Dropout

다른 Regularization 기법들과(L1, L2, maxnorm) 상호 보완적인 방법으로 알려져 있다. 드롭아웃은 각 계층마다 일정 비율의 뉴런을 임의로 정해 drop 시켜 나머지 뉴런들만 학습하도록 하는 방법이다.

2019년 1월 22일 화요일

[ML] 2주차(1)

요약

* 미분 공부가 필요할 것 같다. 확 와닿지 않는다.
* 이때까지의 진도가 모두의 딥러닝 부분에서 완전히 이해한 부분이 대부분이어서 어려움 없이 넘어갔었는데, 2주차부터는 새로운 내용도 많아지고 수학적 개념을 모두의 딥러닝보다 자세히 설명하고 넘어가서 복습이 필요할 것 같다.

소제목

Multiple Features

Note: [7:25 - θ^T is a 1 by (n+1) matrix and not an (n+1) by 1 matrix]

Linear regression with multiple variables is also known as "multivariate linear regression".

We now introduce notation for equations where we can have any number of input variables.

The multivariable form of the hypothesis function accommodating these multiple features is as follows:

In order to develop intuition about this function, we can think about θ0 as the basic price of a house, θ1 as the price per square meter, θ2 as the price per floor, etc. x_1 will be the number of square meters in the house, x_2 the number of floors, etc.

Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:

This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more.

Remark: Note that for convenience reasons in this course we assume x_{0}^{(i)} =1 \text{ for } (i\in { 1,\dots, m } )x0(i)=1 for (i∈1,…,m). This allows us to do matrix operations with theta and x. Hence making the two vectors 'θ' and x^{(i)}match each other element-wise (that is, have the same number of elements: n+1).]

Gradient Descent For Multiple Variables

The gradient descent equation itself is generally the same form; we just have to repeat it for our 'n' features:

In other words:

The following image compares gradient descent with one variable to gradient descent with multiple variables:

Gradient Descent in Practice I - Feature Scaling

Note: [6:20 - The average size of a house is 1000 but 100 is accidentally written instead]

We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:

These aren't exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.

Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:

Where μ_i is the average of all the values for feature (i) and s_i is the range of values (max - min), or s_i is the standard deviation.

Note that dividing by the range, or dividing by the standard deviation, give different results. The quizzes in this course use range - the programming exercises use standard deviation.

For example, if x_i represents housing prices with a range of 100 to 2000 and a mean value of 1000, then,

Gradient Descent in Practice II - Learning Rate

Note: [5:20 - the x -axis label in the right graph should be θ rather than No. of iterations ]

Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.

Automatic convergence test. Declare convergence if J(θ) decreases by less than E in one iteration, where E is some small value such as 10^(−3). However in practice it's difficult to choose this threshold value.

It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.

To summarize:

If α is too small: slow convergence.

If α is too large: may not decrease on every iteration and thus may not converge.

Features and Polynomial Regression

We can improve our features and the form of our hypothesis function in a couple different ways.

We can combine multiple features into one. For example, we can combine x_1 and x_2 into a new feature x_3 by talking x_1*x_2.

Polynomial Regression
Our hypothesis function need not be linear (a straight line) if that does not fit the data well.

We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).

For example, if our hypothesis function is h_θ(x) = θ0 + θ1*x_1 then we can create additional features based on x_1, to get the quadratic function hθ(x) = θ0 + θ1x_1 + θ2(x_2)^2 or the cubinc function hθ(x) = θ0 + θ1x_1 + θ2(x_2)^2 + θ3(x_3)^3

In the cubic version, we have created new features x_2 and x_3 where x_2=(x_1)^2 and x_3=(x_1)^3.

To make it a square root function, we could do: hθ(x) = θ0 + θ1x_1 + θ2sqrt(x_1)

One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.

eg. if x_1 has range 1 - 1000 then range of x_1^2 becomes 1 - 1000000 and that of x_1^3 becomes 1 - 1000000000

2019년 1월 21일 월요일

[ML] 1주차(3)

요약

소제목

Gradient Descent

So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in the hypothesis function. That's where gradient descent comes in.

Imagine that we graph our hypothesis function based on its fields 세타0 and 세타1 (actually we are graphing the cost function as a function of the parameter estimates). We are not graphing x and y itself, but the parameter range of our hypothesis function and the cost resulting from selecting a particular set of parameters.

We put 세타0 on the x axis and 세타1 on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts such a setup.

We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum. The red arrows show the minimum points in the graph.

The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate.

For example, the distance between each 'star' in the graph above represents a step determined by our parameter α. A smaller α would result in a smaller step and a larger α results in a larger step. The direction in which the step is taken is determined by the partial derivative of J(\theta_0,\theta_1). Depending on where one starts on the graph, one could end up at different points. The image above shows us two different starting points that end up in two different places.

The gradient descent algorithm is:

repeat until convergence:

where

j=0,1 represents the feature index number.

At each iteration j, one should simultaneously update the parameters θ1, θ2,...,θn . Updating a specific parameter prior to calculating another one on the j^{(th)} iteration would yield to a wrong implementation.

Gradient Descent Intuition

In this video we explored the scenario where we used one parameter θ1 and plotted its cost function to implement a gradient descent. Our formula for a single parameter was :

Repeat until convergence:

Regardless of the slope's sign for (d/d*θ1)*J(θ1), θ1 eventually converges to its minimum value. The following graph shows that when the slope is negative, the value of θ1 increases and when it is positive, the value of θ1 decreases.

On a side note, we should adjust our parameter α to ensure that the gradient descent algorithm converges in a reasonable time. Failure to converge or too much time to obtain the minimum value imply that our step size is wrong.

How does gradient descent converge with a fixed step size α?

The intuition behind the convergence is that (d/d*θ1)*J(θ1) approaches 0 as we approach the bottom of our convex function. At the minimum, the derivative will always be 0 and thus we get:

Gradient Descent For Linear Regression

Note: [At 6:15 "h(x) = -900 - 0.1x" should be "h(x) = 900 - 0.1x"]

When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to :

where m is the size of the training set, θ0 a constant that be changing simultaneously with θ1, and xi, yi are values of the given training set (data).

Note that we have separated out the two cases for θj into separate equations for θ0 and θ1; and that for θ1 we are multiplying xi at the end due to the derivative. The following is a derivation of (α/α*θj)*J for a single example:

The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these gradient descent equations, our hypothesis will become more and more accurate.

So, this is simply gradient descent on the original cost function J. This method looks at every example in the entire training set on every step, and is called batch gradient descent. Note that, while gradient descent can be susceptible to local minimal in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum. Indeed, J is a convex quadratic function. Here is an example of gradient descent as it is run to minimize a quadratic function.

The ellipses shown above are the contours of a quadratic function. Also shown is the trajectory taken by gradient descent, which was initialized at (48,30). The x’s in the figure (joined by straight lines) mark the successive values of θ that gradient descent went through as it converged to its minimum.

2019년 1월 20일 일요일

[ML] 1주차(2)

요약

소제목

Model Representation

Model Representation
To establish notation for future use, we’ll use x^{(i)} to denote the “input” variables (living area in this example), also called input features, and y^{(i)} to denote the “output” or target variable that we are trying to predict (price). A pair (x^{(i)} , y^{(i)} ) is called a training example, and the dataset that we’ll be using to learn—a list of m training examples (x(i),y(i));i=1,...,m—is called a training set. Note that the superscript “(i)” in the notation is simply an index into the training set, and has nothing to do with exponentiation. We will also use X to denote the space of input values, and Y to denote the space of output values. In this example, X = Y = ℝ.

To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is therefore like this:

When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.

Cost Function

Cost Function
We can measure the accuracy of our hypothesis function by using a cost function. This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's and the actual output y's.

To break it apart, it is 1/2 x'' where x'' is the mean of the squares of h세타(xi) - yi, or the difference between the predicted value and the actual value.

This function is otherwise called the "Squared error function", or "Mean squared error". The mean is halved (1/2) as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term. The following image summarizes what the cost function does:

Cost Function - Intuition I

If we try to think of it in visual terms, our training data set is scattered on the x-y plane. We are trying to make a straight line (defined by h(x)) which passes through these scattered data points.

Our objective is to get the best possible line. The best possible line will be such so that the average squared vertical distances of the scattered points from the line will be the least. Ideally, the line should pass through all the points of our training data set. In such a case, the value of J(세타0, 세타1) will be 0. The following example shows the ideal situation where we have a cost function of 0.

When 세타1 = 1, we get a slope of 1 which goes through every single data point in our model. Conversely, when 세타1 = 0.5, we see the vertical distance from our fit to the data points increase.

This increases our cost function to 0.58. Plotting several other points yields to the following graph:

Thus as a goal, we should try to minimize the cost function. In this case, 세타1 = 1 is our global minimum.

Cost Function - Intuition II

A contour plot is a graph that contains many contour lines. A contour line of a two variable function has a constant value at all points of the same line. An example of such a graph is the one to the right below.

Taking any color and going along the 'circle', one would expect to get the same value of the cost function. For example, the three green points found on the green line above have the same value for J(세타0, 세타1) and as a result, they are found along the same line. The circled x displays the value of the cost function for the graph on the left when 세타0 = 800 and 세타1 = -0.15. Taking another h(x) and plotting its contour plot, one gets the following graphs:

When 세타0 = 360 and 세타1 = 0, the value of J(세타0, 세타1) in the contour plot gets closer to the center thus reducing the cost function error. Now giving our hypothesis function a slightly positive slope results in a better fit of the data.

The graph above minimizes the cost function as much possible and consequently, the result of 세타1 and 세타0 tend to be around 0.12 and 250 respectively. Plotting those values on our graph to the right seems to put our point in the center of the inner most 'circle'.

2019년 1월 19일 토요일

[ML] 1주차(1)

요약

What is Machine Learning?

What is Machine Learning?
Two definitions of Machine Learning are offered. Arthur Samuel described it as: "the field of study that gives computers the ability to learn without being explicitly programmed." This is an older, informal definition.

Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

Example: playing checkers.

E = the experience of playing many games of checkers

T = the task of playing checkers.

P = the probability that the program will win the next game.

In general, any machine learning problem can be assigned to one of two broad classifications:

Supervised learning and Unsupervised learning.

Supervised Learning

Supervised Learning
In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.

Supervised learning problems are categorized into "regression" and "classification" problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.

Example 1:

Given data about the size of houses on the real estate market, try to predict their price. Price as a function of size is a continuous output, so this is a regression problem.

We could turn this example into a classification problem by instead making our output about whether the house "sells for more or less than the asking price." Here we are classifying the houses based on price into two discrete categories.

Example 2:

(a) Regression - Given a picture of a person, we have to predict their age on the basis of the given picture

(b) Classification - Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.

Unsupervised Learning

Unsupervised Learning
Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.

We can derive this structure by clustering the data based on relationships among the variables in the data.

With unsupervised learning there is no feedback based on the prediction results.

Example:

Clustering: Take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on.

Non-clustering: The "Cocktail Party Algorithm", allows you to find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds at a cocktail party).