R feature selection 예제

2013년 10월 18일 Hong 댓글 남기기

R로 된 feature selection 하는 코드입니다.
어디선가 퍼왔는데 출처가 기억이 나질 않네요.
부연설명을 드리면 모델링을 할 때 feature(또는 독립변수)중 어떤 것이 중요한 것인지 판별하는 코드입니다.

http://github.com/euriion/code_snippets/blob/master/R/feature_selection.R

Python, 데이터엔지니어링 Data Engineering

Python multi core 구동 코드

2013년 10월 17일 Hong 댓글 남기기

Python을 이용해서 ETL의 일부인 파싱이나 전처리 작업을 수행하는 경우가 많습니다.
빅데이터인 경우에도 데이터를 Hadoop이나 Hive 또는 Oracle과 같은 RDBMS에 로딩하기 전에 할 수 있는 것들은 최대한 전처리를 한 후에 사용하는 경우가 많이 있습니다.
물론 데이터량이 아주 많으면 Map/Reduce를 작성하는 것이 더 낫습니다만 그리 크지 않은 데이터는 한 대의 서버에서 자원을 풀가동해서 처리해 버리는 것이 작업속도를 줄일 수 있습니다.
Hadoop이 일반화되기 이전에는 이런 형태의 코드를 더 구체화해서 여러 대의 서버에서 동시에 구동되도록 (마치 맵리듀스처럼) 프로세스를 돌리고 결과를 취합하는 것을 만드는 것이 빈번했었습니다.

https://gist.github.com/euriion/5719443

코드를 수정하면 더 복잡한 것도 할 수 있습니다만 매우 복잡하다면 다른 구조를 생각해 보는 것이 좋습니다.

Python, 데이터엔지니어링 Data Engineering

CSV포맷을 TSV포맷으로 바꾸는 간단한 스크립트

2013년 10월 17일 Hong 댓글 남기기

엑셀(Excel)에서 CSV 포맷으로 파일을 저장할 때 텍스트 컬럼을 Escaping처리하는 경우가 있습니다.
주로 쉼표(comma)와 따옴표(double quotation)을 그렇게 변환해 버리는데 Hadoop이나 이 포팻을 Hive에 업로드해서 사용하려면 Escaping을 빼야 합니다.
크기가 크지 않은 CSV는 간단하게 Python으로 변환코드를 작성해서 올려서 사용하는 것이 편한데 그럴때 사용했던 소스코드입니다.
R에서 데이터를 로딩할 때도 이 방법이 편합니다.
이런 간단한 작업도 넓은 의미에서는 데이터 먼징 (Data Munging) 포함됩니다.

https://gist.github.com/euriion/5720809

R, 통계

R ARIMA 예제 코드

2013년 10월 03일 Hong 댓글 남기기

R의 ARIMA 모형의 예제입니다.
서버의 메모리의 사용량의 추이를 보고 얼마 후에 고갈되는지를 예측하는 코드입니다.
물론 예측력은 많이 떨어지고 현실성이 없을 수 있습니다.

# -------------------------
# Memory usage forecasting
# -------------------------
library(stats)
arima(lh, order = c(1,0,0))
arima(lh, order = c(3,0,0))
arima(lh, order = c(1,0,1))

arima(lh, order = c(3,0,0), method = "CSS")

arima(USAccDeaths, order = c(0,1,1), seasonal = list(order=c(0,1,1)))
arima(USAccDeaths, order = c(0,1,1), seasonal = list(order=c(0,1,1)),
method = "CSS") # drops first 13 observations.
# for a model with as few years as this, we want full ML

arima(LakeHuron, order = c(2,0,0), xreg = time(LakeHuron)-1920)

## presidents contains NAs
## graphs in example(acf) suggest order 1 or 3
require(graphics)
(fit1 <- arima(presidents, c(1, 0, 0)))
tsdiag(fit1)
(fit3 <- arima(presidents, c(3, 0, 0))) # smaller AIC
tsdiag(fit3)

# ----- prediction part

od <- options(digits=5) # avoid too much spurious accuracy
predict(arima(lh, order = c(3,0,0)), n.ahead = 12)

(fit <- arima(USAccDeaths, order = c(0,1,1),
seasonal = list(order=c(0,1,1))))
predict(fit, n.ahead = 6)
options(od)

# ----- Arima
library(forecast)
fit <- Arima(WWWusage,c(3,1,0))
plot(forecast(fit))

x <- fracdiff.sim( 100, ma = -.4, d = .3)$series
fit <- arfima(x)
plot(forecast(fit,h=30))

# ----- Arima forecast for memory usage (unit %) -----
library(forecast) # need to install the package "forecast"
memory.usage.threshold <- 100 # 100%
memory.usage.forecast.period <- 30 # 미래 30일분까지 예측
memory.usage.observations.startdate <- "2012-09-01"
memory.usage.observations <- c(10,11,30,35,36,39,48,56,75,69,68,72,71,72,83) # 관측치 12일분

memory.usage.period <- seq(as.Date(memory.usage.observations.startdate), length=length(memory.usage.observations), by="1 day") # 날짜세팅
memory.usage.df <- data.frame(row.names=memory.usage.period, memory=memory.usage.observations) # data.frame으로 변환
memory.usage.ts <- ts(data=memory.usage.df) # time series 생성
memory.usage.model <- auto.arima(memory.usage.ts) # arima 모델 생성
memory.usage.forecast <- forecast(memory.usage.model, h=memory.usage.forecast.period) # forecast 결과 생성
memory.usage.forecast.df <- as.data.frame(memory.usage.forecast) # forecast 결과 변환

d = memory.usage.threshold,][1,])) # 100 이 넘는 최초 데이터 추출
if(is.na(d)) {
print(sprintf("앞으로 %s일동안 %s%% 초과하지 않음", memory.usage.forecast.period, d - length(memory.usage.observations)))
} else {
print(sprintf("%s일 후에 %s%% 초과됨", d - length(memory.usage.observations), memory.usage.threshold))
}

# ---- 시각화(Plotting)
plot(memory.usage.forecast) # plotting
abline(h=100, col = "red", lty=3)
abline(v=d, col = "red", lty=3)

library(ggplot2)
library(scales)

plt <- ggplot(data=pd,aes(x=date,y=observed))
p1a<-p1a+geom_line(col='red')
p1a<-p1a+geom_line(aes(y=fitted),col='blue')
p1a<-p1a+geom_line(aes(y=forecast))+geom_ribbon(aes(ymin=lo95,ymax=hi95),alpha=.25)
p1a<-p1a+scale_x_date(name='',breaks='1 year',minor_breaks='1 month',labels=date_format("%b-%y"),expand=c(0,0))
p1a<-p1a+scale_y_continuous(name='Units of Y')
p1a<-p1a+opts(axis.text.x=theme_text(size=10),title='Arima Fit to Simulated Datan (black=forecast, blue=fitted, red=data, shadow=95% conf. interval)')

원본 소스코드는 아래에 있습니다.

https://github.com/euriion/code_snippets/blob/master/R/forecast_exam.R

토탈 데이터 사이언스 – Total Data Science

월별 글 목록: 2013년 10월월

R feature selection 예제

Python multi core 구동 코드

CSV포맷을 TSV포맷으로 바꾸는 간단한 스크립트

R ARIMA 예제 코드