Hong의 모든 글

R 버전 확인하는 법

2014년 03월 18일 Hong 댓글 한 개

R의 버전을 확인하는 방법 몇가지입니다.

R의 버전을 확인해야 할 이유는 거의 없습니다.

패키지를 제작할 때 사용자가 여러 버전 R을 사용할 수 있는 점을 고려해 코드가 R버전에 따라 작동을 달리 달 수 있는 부분 때문에 체크를 해야 할 일이 있는 경우 정도입니다.

R의 대부분 패키지들이 버전의 특성을 많이 타거나 민감하지 않기 때문에 R의 버전을 확인하거나 해서 패키지를 골라 설치하거나 할 필요가 없기 때문입니다.

단 R의 기능 변경으로 인해 R의 버전을 확인해보고 싶을 때가 있을 것입니다. 또는 패키지를 제작하신다면 확인이 필요하기도 하지요.

R버전 확인법입니다.

실행 후 환영 메세지 확인하기

R의 버전을 확인하는 가장 쉬운 방법은 R을 실행하면 됩니다. 다음과 같은 화면이 나오면서 출력됩니다. CLI 모드에서도 동일합니다.

R version 3.4.3 (2017-11-30) -- "Kite-Eating Tree"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

sessionInfo() 함수를 이용하는 방법

sessionInfo() 함수를 실행하면 현재 R의 세션에 대한 정보를 보여주는데 R 버전도 함께 보여줍니다.

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Korean_Korea.949  LC_CTYPE=Korean_Korea.949    LC_MONETARY=Korean_Korea.949
[4] LC_NUMERIC=C                 LC_TIME=Korean_Korea.949    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2    yaml_2.1.16

package_version() 함수를 사용

다음과 같이 하면 간단하게 버전만 알아낼 수 있습니다. 패키지를 작성하거나 버전 호환성을 확인할 때 흔히 쓰는 방법입니다.

> package_version(R.version)
[1] ‘3.4.2’

단순히 현재 설치된 R의 버전을 알고 싶다면 위의 코드에서 사용한 R.version을 그대로 출력해서 봐도 됩니다.

R.version을 사용

앞서 말한 R.version을 그대로 화면에 출력하는 방법입니다.

> R.version
               _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          4.2                         
year           2017                        
month          09                          
day            28                          
svn rev        73368                       
language       R                           
version.string R version 3.4.2 (2017-09-28)
nickname       Short Summer

이게 가장 쉬운 방법입니다.

R, 인공지능, 기계학습 ML/AI

R feature selection 예제

2013년 10월 18일 Hong 댓글 남기기

R로 된 feature selection 하는 코드입니다.
어디선가 퍼왔는데 출처가 기억이 나질 않네요.
부연설명을 드리면 모델링을 할 때 feature(또는 독립변수)중 어떤 것이 중요한 것인지 판별하는 코드입니다.

http://github.com/euriion/code_snippets/blob/master/R/feature_selection.R

Python, 데이터엔지니어링 Data Engineering

Python multi core 구동 코드

2013년 10월 17일 Hong 댓글 남기기

Python을 이용해서 ETL의 일부인 파싱이나 전처리 작업을 수행하는 경우가 많습니다.
빅데이터인 경우에도 데이터를 Hadoop이나 Hive 또는 Oracle과 같은 RDBMS에 로딩하기 전에 할 수 있는 것들은 최대한 전처리를 한 후에 사용하는 경우가 많이 있습니다.
물론 데이터량이 아주 많으면 Map/Reduce를 작성하는 것이 더 낫습니다만 그리 크지 않은 데이터는 한 대의 서버에서 자원을 풀가동해서 처리해 버리는 것이 작업속도를 줄일 수 있습니다.
Hadoop이 일반화되기 이전에는 이런 형태의 코드를 더 구체화해서 여러 대의 서버에서 동시에 구동되도록 (마치 맵리듀스처럼) 프로세스를 돌리고 결과를 취합하는 것을 만드는 것이 빈번했었습니다.

https://gist.github.com/euriion/5719443

코드를 수정하면 더 복잡한 것도 할 수 있습니다만 매우 복잡하다면 다른 구조를 생각해 보는 것이 좋습니다.

Python, 데이터엔지니어링 Data Engineering

CSV포맷을 TSV포맷으로 바꾸는 간단한 스크립트

2013년 10월 17일 Hong 댓글 남기기

엑셀(Excel)에서 CSV 포맷으로 파일을 저장할 때 텍스트 컬럼을 Escaping처리하는 경우가 있습니다.
주로 쉼표(comma)와 따옴표(double quotation)을 그렇게 변환해 버리는데 Hadoop이나 이 포팻을 Hive에 업로드해서 사용하려면 Escaping을 빼야 합니다.
크기가 크지 않은 CSV는 간단하게 Python으로 변환코드를 작성해서 올려서 사용하는 것이 편한데 그럴때 사용했던 소스코드입니다.
R에서 데이터를 로딩할 때도 이 방법이 편합니다.
이런 간단한 작업도 넓은 의미에서는 데이터 먼징 (Data Munging) 포함됩니다.

https://gist.github.com/euriion/5720809

R, 통계

R ARIMA 예제 코드

2013년 10월 03일 Hong 댓글 남기기

R의 ARIMA 모형의 예제입니다.
서버의 메모리의 사용량의 추이를 보고 얼마 후에 고갈되는지를 예측하는 코드입니다.
물론 예측력은 많이 떨어지고 현실성이 없을 수 있습니다.

# -------------------------
# Memory usage forecasting
# -------------------------
library(stats)
arima(lh, order = c(1,0,0))
arima(lh, order = c(3,0,0))
arima(lh, order = c(1,0,1))

arima(lh, order = c(3,0,0), method = "CSS")

arima(USAccDeaths, order = c(0,1,1), seasonal = list(order=c(0,1,1)))
arima(USAccDeaths, order = c(0,1,1), seasonal = list(order=c(0,1,1)),
method = "CSS") # drops first 13 observations.
# for a model with as few years as this, we want full ML

arima(LakeHuron, order = c(2,0,0), xreg = time(LakeHuron)-1920)

## presidents contains NAs
## graphs in example(acf) suggest order 1 or 3
require(graphics)
(fit1 <- arima(presidents, c(1, 0, 0)))
tsdiag(fit1)
(fit3 <- arima(presidents, c(3, 0, 0))) # smaller AIC
tsdiag(fit3)

# ----- prediction part

od <- options(digits=5) # avoid too much spurious accuracy
predict(arima(lh, order = c(3,0,0)), n.ahead = 12)

(fit <- arima(USAccDeaths, order = c(0,1,1),
seasonal = list(order=c(0,1,1))))
predict(fit, n.ahead = 6)
options(od)

# ----- Arima
library(forecast)
fit <- Arima(WWWusage,c(3,1,0))
plot(forecast(fit))

x <- fracdiff.sim( 100, ma = -.4, d = .3)$series
fit <- arfima(x)
plot(forecast(fit,h=30))

# ----- Arima forecast for memory usage (unit %) -----
library(forecast) # need to install the package "forecast"
memory.usage.threshold <- 100 # 100%
memory.usage.forecast.period <- 30 # 미래 30일분까지 예측
memory.usage.observations.startdate <- "2012-09-01"
memory.usage.observations <- c(10,11,30,35,36,39,48,56,75,69,68,72,71,72,83) # 관측치 12일분

memory.usage.period <- seq(as.Date(memory.usage.observations.startdate), length=length(memory.usage.observations), by="1 day") # 날짜세팅
memory.usage.df <- data.frame(row.names=memory.usage.period, memory=memory.usage.observations) # data.frame으로 변환
memory.usage.ts <- ts(data=memory.usage.df) # time series 생성
memory.usage.model <- auto.arima(memory.usage.ts) # arima 모델 생성
memory.usage.forecast <- forecast(memory.usage.model, h=memory.usage.forecast.period) # forecast 결과 생성
memory.usage.forecast.df <- as.data.frame(memory.usage.forecast) # forecast 결과 변환

d = memory.usage.threshold,][1,])) # 100 이 넘는 최초 데이터 추출
if(is.na(d)) {
print(sprintf("앞으로 %s일동안 %s%% 초과하지 않음", memory.usage.forecast.period, d - length(memory.usage.observations)))
} else {
print(sprintf("%s일 후에 %s%% 초과됨", d - length(memory.usage.observations), memory.usage.threshold))
}

# ---- 시각화(Plotting)
plot(memory.usage.forecast) # plotting
abline(h=100, col = "red", lty=3)
abline(v=d, col = "red", lty=3)

library(ggplot2)
library(scales)

plt <- ggplot(data=pd,aes(x=date,y=observed))
p1a<-p1a+geom_line(col='red')
p1a<-p1a+geom_line(aes(y=fitted),col='blue')
p1a<-p1a+geom_line(aes(y=forecast))+geom_ribbon(aes(ymin=lo95,ymax=hi95),alpha=.25)
p1a<-p1a+scale_x_date(name='',breaks='1 year',minor_breaks='1 month',labels=date_format("%b-%y"),expand=c(0,0))
p1a<-p1a+scale_y_continuous(name='Units of Y')
p1a<-p1a+opts(axis.text.x=theme_text(size=10),title='Arima Fit to Simulated Datan (black=forecast, blue=fitted, red=data, shadow=95% conf. interval)')

원본 소스코드는 아래에 있습니다.

https://github.com/euriion/code_snippets/blob/master/R/forecast_exam.R