중개 플랫폼 서비스 : Hive

hive 개요
Source로 설치 파일 만들기
CentOS에서 Hive 설치

사전 준비 사항
설치

Hive Architecture
HiveQL

Hive CLI 기초
Data
HiveQL DDL (Data Definition Language)
HiveQL DML(Data Manipulation Language)
함수

Hive 매뉴얼
Hive 개발자 매뉴얼

데이터 입출력
사용자 정의 함수
스트림 관리
사용자 정의 Hook
사용자 정의 색인 핸들러
Thrift Client

참고 문헌

MapReduce를 사용하는 선언적 언어인 hive를 정리 합니다.

홈페이지 : http://hive.apache.org/
HiveQL
다운로드 : http://hive.apache.org/releases.html, http://www.apache.org/dyn/closer.cgi/hive/
라이센스 : Apache 2.0
플랫폼 : Java

hive 개요

Hadoop 기반의 데이터 웨어하우징용 솔루션
페이스북에서 개발하여 오픈소스로 공개
HiveQL 사용

Source로 설치 파일 만들기

 svn co http://svn.apache.org/repos/asf/hive/trunk hive-trunk
 cd hive-trunk
 ant package
 ls -alF build/dist/

CentOS에서 Hive 설치

사전 준비 사항

Pig 0.11.1
Hadoop 1.1.2
Java 1.7.0_19
CentOS 6.4, 64 bits
MySQL에 hive 데이터베이스를 생성하고 Hive용 table을 생성 합니다.

 mysql -uhive -p hive
     source /appl/hive/src/metastore/scripts/upgrade/mysql/hive-schema-0.10.0.mysql.sql
     show tables;
     exit;

설치

Hive를 다운로드하여 /appl/hive 폴더에 압축을 풉니다.

 wget http://apache.mirror.cdnetworks.com/hive/hive-0.12.0/hive-0.12.0.tar.gz
 tar zxvf hive-0.12.0.tar.gz
 chown -R root:root hive-0.12.0
 mv hive-0.12.0 /appl/hive
 
 //--- JDBC Driver 복사
 cp /cloudnas/install/mysql-connector-java-5.1.25-bin.jar /appl/hive/lib

vi .bashrc

 export HIVE_HOME=/appl/hive
 export PATH=$PATH:$HIVE_HOME/bin

Hive에서 사용할 HDFS 디렉토리 구성

 hadoop dfs -mkdir /tmp
 hadoop dfs -mkdir /user/hive/warehouse
 hadoop dfs -chmod g+w /tmp
 hadoop dfs -chmod g+w /user/hive/warehouse

vi /appl/hive/conf/hive-site.xml
- fs.default.name : Name Node 접속 정보

 
  
 
 
  
    fs.default.name
    hdfs://cloud001.cloudserver.com:9000
  
 
  
    javax.jdo.option.ConnectionURL
    jdbc:mysql://localhost:3306/hive?useUnicode=true&characterEncoding=UTF-8
  
  
    javax.jdo.option.ConnectionDriverName
    org.gjt.mm.mysql.Driver
  
  
    javax.jdo.option.ConnectionUserName
    hive
  
  
    javax.jdo.option.ConnectionPassword
    ???
  
  
    datanucleus.autoCreateSchema
    false
  
  
    datanucleus.fixedDatastore
    true

서비스 확인

 start-all.sh                 //--- Hadoop이 먼저 실행이 되어 있어야 합니다.
 hive
     show tables;
     exit;
 hive --help

참고 문헌
- Hive 설치 및 환경구축하기, 2013.1

Hive Architecture

HiveQL

Hive CLI 기초

Hive cli 실행

 hive                        //--- hive --service cli

hive 명령 실행시 먼저 실행되는 환경 설정 파일

 ~/.hiverc

환경 변수 관련 NameSpace
- hivevar (Default, 생략 가능), hiveconf, system, env (읽기 전용)

 set;                        //--- 전체 환경 변수 표시
 set env:HIVE_HOME;          //--- HIVE_HOME 환경 변수 표시
 set hivevar:foo=~;          //--- 환경 변수에 값 지정
 ${환경변수}                 //--- 명령행에서 환경 변수 사용 방법

HiveQL : ~.hql, ~.q

 hive -f ~.hql
 hive 
     source ~.hql;
     exit;

Data

Hive Table에서 사용할 수 있는 데이터 형
- string
- tinyint, smallint, int, bigint
- float, double
- boolean, timestamp, binary
- struct, map, array : array, map<string, int>, struct<~:string, ~:int>
  - item = struct('~~', '~~'); //--- item.name
  - item = map('name1', 'value1', 'name2', 'value2'); //--- item"name1", item.name1
  - item = array('val1', 'val2'); //--- item0, 0, 1, 2, ...
데이터 구성
- \n : 레코드 구분
- ^A : 항목 구분 (\001), Ctrl_A
- ^B : struct, map, array에서 각 항목 구분 (\002), Ctrl_B
- ^C : map에서 key와 value 구분 (\003), Ctrl_C

HiveQL DDL (Data Definition Language)

Database 관리
- default : Default로 제공되는 database 이름
  hive-default.xml 파일에서 hive.metastore.warehouse.dir로 저장되는 위치를 지정
- Default hive.metastore.warehouse.dir : /user/hive/warehouse

 use ~;                                 //--- 사용할 database를 선택
 create database ~                      //--- /user/hive/warehouse/~.db 폴더 생성
        location '/user/test/warehouse'
        comment '~'
        with dbproperties (name1 = value2, nam2 = value2);
 show databases [like '~*';
 describe database ~;
 set hive.cli.print.current.db=true;    //--- 현재 사용하고 있는 database를 화면에 표시
 drop database if exists ~ [cascade](cascade.md);   //--- cascade : Database에 있는 table도 모두 삭제

Managed Table 관리
- 테이블의 full Name : dbName.tableName, /user/hive/warehouse/dbName.db/tableName 에 저장

 create table [if not exists] [~.](~..md)~ (
     ~ string ['~'](comment),
     ~ int
     )
     comment '~'
     tblproperties (name1 = value2, nam2 = value2)
     location '/user/hive/warehouse/~.db/~'
     row format delimited
         fields terminated by '\001\
         collection items terminated by '\002'
         map keys terminated by '\003'
         lines terminated by '\n'
     stored as textfile;
 create table ~ like ~;                    //--- 하나의 테이블의 Schema를 복사하여 다른 테이블 생성
 create table ~ 
     as select ~
          from ~
         group by ~
         order by ~;
 show tables ['~'] [~](in);
 describe [| extended](formatted) ~;
 drop table [exists](if) ~;

External table 관리
- 외부 테이블은 테이블 삭제시 데이터는 삭제되지 않습니다.

 create external table ~ (
     ~
     )
     location '/data/aaa';

Partitioned table 관리
- /user/hive/warehouse/~~.db/~~/p1=~~/p2=~~ 에 데이터 저장
- p1, p2는 필드 처럼 사용

 //--- strict : partitioned field 외에는 where 조건에서 사용하지 못하도록 함
 //--- nonstrict : 모든 필드를 where 조건에서 사용 가능
 set hive.mapred.mode=strict;
 create table ~ (
     ~
     )
     partitioned by (p1 string, p1 string);
 show partitions ~ [partition(p1='~')](partition(p1='~').md);
 
 alter table ~ 
       add [not exists](if) partition(~=~)    //--- table에 partition을 추가하고 데이터의 위치와 연결
       location '~';
 alter table ~
       drop [exists](if) partition(~);

HiveQL DML(Data Manipulation Language)

Data 로드 및 저장
- local : 사용. 데이터 복사 (지역 데이터), 미사용. 데이터 이동
- overwrite : 사용. 해당 폴더의 파일을 모두 삭제후 추가, 미사용. 데이터 추가

 load data [local](local.md) inpath '~'         //--- inpath에는 폴더를 지정 합니다.
      [overwrite](overwrite.md) into table ~
      partition (~=~, ~=~);
 
 //--- Hive 테이블의 데이터를 외부 파일로 저장
 insert overwrite [local](local.md) directory '~'    //--- overwrite 대신 into 사용 가능
        select ~;

Table CRUD : Insert

 //--- into 대신에 overwrite를 사용하면 원래 데이터를 지우고 새로 데이터가 추가 됩니다.
 insert into table ~
        partition (~=~, ~=~)
        select * from ~ where ~;
 
 //--- 동작 partition
 //--- hive.exec.dynamic.partition = false              //--- true. 동적 partition 모드
 //--- hive.exec.dynamic.partition.mode = strict        //--- nonstrict. 모든 partition 컬럼이 동적으로 할당
 //--- hive.exec.max.dynamic.partitions.pernode = 100   //--- Node당 최대 동적 파티션의 갯수
 //--- hive.exec.max.dynamic.partitions = 1000          //--- insert문이 만들수 있는 최대 동적 파티션 갯수
 //--- hive.exec.max.created.files = 100000             //--- 하나의 query가 만들수 있는 최대 파일 갯수
 insert into table ~              //--- select 문의 마지막에 사용한 field를 partition field와 매핑하여 데이터 저장
        partition (~=~, ~=~)
        select ~, ~, ~, ~ from ~;
 //--- 정적 partition
 from ~                           //--- 데이터를 한번만 읽어 다수의 insert문을 적용 합니다.
      insert into table ~
             partition (~)
             select * where ~
      insert into table ~
             partition (~)
             select * where ~;

Table CRUD : select
- jon문 적용 후 where 절 평가

 select ~ as ~,                    //--- 'aa.*' : aa로 시작하는 모든 필드를 조회 합니다.
        case
            when 조건 then '~'
            else '~'
        end as ~
   from ~ as ~
        join ~ on ~ = ~           //--- 가장 큰 테이블을 뒤에 배치
        left outer join ~ on ~    //--- 왼쪽에 있는 레코드를 반환, 오른쪽에 값이 없다면 null을 반환
        left semi join ~ on ~     //--- 조건을 만족하는 왼쪽에 있는 레코드를 반환
        right outter join ~ on ~  //--- 오른쪽에 있늘 레코드를 반환, 왼쪽에 값이 없다면 null을 반환
        full outer join ~ on ~
  where ~ and ~ like '%aa"'       //--- str rlike ~, str regexp ~ : 정규 표현식(~)과 일치하면 true
  group by ~
  having ~                        //--- group by 에서 생성된 결과로 조건 처리
  order by ~                      //--- 전체 데이터 정렬
  distribute by ~                 //--- sort by를 보완, ~별로 reducer에서 처리
  sort by ~                       //--- 각 Node(Reducer)에서만 정렬
  cluster by ~                    //--- distribute by 와 sort by의 결합
  limit ~;
 
 from ~
      select ~
       where ~;
 
 select *                         //--- 표본 데이터 추출, m. 전체 bucket 갯수, n. 가져올 bucket 번호 (1, 2, ...)
   from ~ tablesample(bucket n out of m on rand()) newName; 
 select *                         //--- hive.sample.seednumber = 7383
   from ~ tablesample(0.1 percent) newName;     //--- Seed number를 사용하여 표본 데이터 추출

uniton all : 두개 이상의 테이블을 합쳐서 결과를 반환 합니다.

 select ~ from ~
 union all
 select ~ from ~

View

 create view [if not exists] ~ [~, ~)]((~,)
        comment '~'
        tblproperties (~)
        as select ~ from ~;
 drop view [exists](if) ~;

함수

함수의 종류
- UDF : User-Defined Function
- UDAF : User-Defined Aggregate Function
- UDTF : User-Defined Table generating Function
Function

 show functions;
 describe function [extended](extended.md) ~;

통계 함수

 bigint count([distinct](distinct.md) ~)
 double sum(~), avg(~), min(~), max(~)
 double var_pop(~), var_samp(~)             //--- 분산 / 표본 분산
 double stddev_pop(~), stddev_samp(~)       //--- 표준 편차 / 표본 표준 편차
 double covar_pop(~), covar_samp(~)         //--- 공분산 / 표본 공분산
 double corr(~, ~)                          //--- 상관 관계
 double percentile(~, p), percentile_approx(~, p, NB)      //--- 백분위, P (double 0 ~ 1), NB = 10000
 array(double) percentile(~, [p1, ...]), percentile_approx(~, [...](p1,), NB)   //--- 백분위
 array histogram_numeric (~, NB)   //--- NB 히스토그램 빈즈의 배열, x. 중간값, y. 높이

레코드 함수

 records explode(array), explode(map)       //--- array와 map로 레코드로 변환
 records stack(n, col1, ... coln)           //--- col*을 n개씩 묶어 레코드로 변환
 tuple json_tuple(jsonStr, p1, ..., pn)
 //--- partName : host, path, query, ref, protocol, authority, file, userinfo, query:keyName
 tuple parse_url_tuple(url, partName1, .., partNamen)
    select parse_url_tuple(url, 'HOST', 'PATH') as (host, path) from ~;

변환 함수

 cast (~ as float)                          //--- ~을 float 형으로 변환
 string regexp_replace(str, regex, replace), regexp_extract(str, regex, index)
 
 //--- 날자 관련 함수
 string from_unixtime(int), to_date(string)
 int year(str), month(str), day(str)

Hive 매뉴얼

Hive 도움말

 hive --help
 hive --service cli --help

Hive Service
- beeline
- cli : default, Command line interface
- help
- hiveserver : Thrift server
- hiveserver2
- hwi : Hive Web Interface
- jar : Hive 환경에서 application을 실행
- lineage
- metastore : 다중 client 지원을 위해 Hive 외부에 MetaStore를 구동하는 서비스
- metatool
- orcfiledump
- rcfilecat : RCFile 내용을 출력
hive cli 사용법

 ! Linux_Shell_명령어
 dfs -help;
 dfs -ls /;                         //--- HDFS 명령어 실행
 set hive.cli.print.header=true;    //--- Table Header 표시

hwi 서비스 실행
- vi /appl/hive/conf/hive-site.xml

  
    hive.hwi.listen.host
    0.0.0.0
  
  
    hive.hwi.listen.port
    9999
  
  
    hive.hwi.war.file
    /lib/hive-hwi-0.11.0.war

hwi 실행

 hive --service hwi

http://localhost:9999/hwi/ 에서 서비스 확인
Thrift server 실행

 hive --service hiveserver &
 netstat -an | grep LISTEN | grep tcp      //--- 사용 port 확인, 10000 port 사용

ZooKeeper를 사용하여 Hive 잠금 설정
- vi /appl/hive/conf/hive-site.xml

  
    hive.zookeeper.quorum
    cloud001.cloudserver.com   //--- ZooKeeper가 여럿 있을 경우 ","로 구분하여 기입 합니다.
  
  
    hive.support.concurrency
    true

hive 에서 사용

 show locks [extended](extended.md);
 lock table ~ exclusive;              //--- 테이블에 대해서 베타적 잠금 설정
 unlock table ~;

Hive 개발자 매뉴얼

데이터 입출력

Textfile 포맷

 create table ~
 stored as textfile;

Sequencefile 포맷 (key/value 로 구성된 파일, 압축시 사용이 편리)

 create table ~
 stored as sequencefile;

RCFile 포맷 (Row, Column 단위로 접근 방식을 제공)

 create table ~ (
     ~
     )
     row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
         with serdeproperties ('~'='~')  
     stored as
         inputformat  'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
         outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat';

RCFile 조회

 hive --service rcfilecat /user/hive/warehouse/~/~

스토리지 Handler

 create table ~ (                //--- hive 테이블 생성
     key int, name string, price float
     )
     stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
     with serdeproperties("hbase.columns.mapping" = ":key,stock:val")
     tblproperties ("hbase.table.name" = "~");
 
 create external table ~ (       //--- 기존 hive table 연동
     ~
     )
     stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
     with serdeproperties("hbase.columns.mapping" = "cf1:val")
     tblproperties ("hbase.table.name" = "~");

700px

Hive에서 데이터 처리
- Input Format Object : 입력 데이터를 레코드로 분리
  - org.apache.hadoop.mapred.TextInputFormat
  - org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat
- SerDe : 레코드를 컬럼으로 분해 또는 컬럼을 레코드로 결합, Serializer/Deserializer
  - org.apache.hadoop.hive.serde2.lazy.LazySimpleSerde
- Output Format Object : 레코드를 저장
  - org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Table 정의를 통해 데이터 처리 class 지정

 create table ~ (
     ~
     )
     row format serde '~'              //--- SerDe의 full class name
         with serdeproperties ('~'='~')  
     stored as
         inputformat '~'               //--- Input Format Object의 full class name
         outputformat '~';             //--- Output Format Object의 full class name

사용자 정의 InputFormat

 public class ~ implements InputFormat {
     public InputSplit[](.md) getSplits(JobConf jc, int i) throws IOException {
     }
 }

사용자 정의 함수

UDF 작성 및 배포
- UDF 생성

 package ~;
 @Description(name="~", ~)
 
 public class UDF~ extends UDF {
     public String evaluate(~) {
         return ~;
     }
 }
 
 public class UDF~ extends GenericUDF {
     private GenericUDFUtils.ReturnObjectInspectorResolver rtdata;
     private ObjectInspector[](.md) args;
 
     //--- 입력 데이터 검사
     public ObjectInspector initialize(ObjectInspector[](.md) arguments) throws UDFArgumentException {
         return (new GenericUDFUtils.ReturnObjectInspectorResolver(true)).get();
     }
 
     //--- 함수 실행
     public Object evaluate(DifferedObject[](.md) arguments) throws HiveException {
         Object rtVal = null;
 
         return rtVal;
     }
 
     //--- Debuging 정보 표시
     public String getDisplayString(String[](.md) children) {
         return ~;
     }
 }

Compile 후 jar 파일 생성
Hive에 임시 등록

 hive
     add jar ~.jar;
     create temporary function ~      //--- 함수 이름 지정
            as '~';                   //--- class의 full path 지정

Hive에 영구 등록
vi ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java

 registerUDF("~", ~.class, false);
 registerGenericUDF("~", ~.class);
 //--- Hive를 다시 빌드 합니다. (hive-exec-*.jar 파일)

매크로

 create temporary macro ~(~ string) 매크로_내용;    //--- 함수명(인수)

스트림 관리

Streaming 함수의 종류 : map(~~), transform(~~) reduce(~)
테스트용 테이터 생성
- vi /etc/zztemp.txt

 123     24
 124     25

스트리밍용 bash Script 작성
- vi /etc/zztemp.bash && chmod 755 /etc/zztemp.bash

 #!/bin/bash
 while read LINE
 do
     echo $LINE
 done

hive에 bash script 등록 및 실행 테스트
- zztemp.bash 대신에 Linux에 있는 /bin/cat 등을 직접 실행할 수 있습니다.

 hive
     create table zztemp (f1 int, f2 int)
            row format delimited fields terminated by '\t';
     load data local inpath 'file:///root/zztemp.txt' into table zztemp;
     add file file:///root/zztemp.bash;        //--- 등록된 프로그램은 작업이 완료되면 삭제 됩니다.
 
     //--- zztemp 테이블에 있는 f1, f2 필드를 zztemp.bash의 표준 입력으로 전달하고 그 결과(newF1, newF2)를 가져 옵니다.
     select transform(f1, f2)
      using 'zztemp.bash' as (newF1 int, newF2 int)
       from zztemp;

사용자 정의 Hook

Hook
- PreHook
- PostHook

사용자 정의 색인 핸들러

https://cwiki.apache.org/confluence/display/Hive/IndexDev#CREATE_INDEX

Thrift Client

Thrift를 사용하여 Hive 연동

 import org.apache.hadoop.hive.service.*;
 import org.apache.thrift.protocol.*;
 import org.apache.thrift.transport.*;
 
 transport = new TSocket("localhost", 10000);
 protocol = new TBinaryProtocol(protocol);
 client = new HiveClient(protocol);
 
 transport.open();
 client.getClusterStatus();
 client.execute("~");
 client.getSchema();
 client.getQueryPlan();
 client.fetchOne(), fetchN(), fetchAll()

참고 문헌

분류: BigData

최종 수정일: 2024-09-30 12:26:18