首页 1 2 3 4 5 6 7

Hive——SerDe

一、背景

1、当进程在进行远程通信时，彼此可以发送各种类型的数据，无论是什么类型的数据都会以二进制序列的形式在网络上传送。发送方需要把对象转化为字节序列才可在网络上传输，称为对象序列化；接收方则需要把字节序列恢复为对象，称为对象的反序列化。

2、Hive的反序列化是对key/value反序列化成hive table的每个列的值。

3、Hive可以方便的将数据加载到表中而不需要对数据进行转换，这样在处理海量数据时可以节省大量的时间。

二、技术细节

1、SerDe是Serialize/Deserilize的简称，目的是用于序列化和反序列化。

2、用户在建表时可以用自定义的SerDe或使用Hive自带的SerDe，SerDe能为表指定列，且对列指定相应的数据。

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

[(col_name data_type [COMMENT col_comment], ...)]

[COMMENT table_comment]

[PARTITIONED BY (col_name data_type

[COMMENT col_comment], ...)]

[CLUSTERED BY (col_name, col_name, ...)

[SORTED BY (col_name [ASC|DESC], ...)]

INTO num_buckets BUCKETS]

[ROW FORMAT row_format]

[STORED AS file_format]

[LOCATION hdfs_path]

创建指定SerDe表时，使用row format row_format参数，例如：

a、添加jar包。在hive客户端输入：hive>add jar /run/serde_test.jar;

或者在linux shell端执行命令：${HIVE_HOME}/bin/hive -auxpath /run/serde_test.jar

b、建表：create table serde_table row format serde 'hive.connect.TestDeserializer';

3、编写序列化类TestDeserializer。实现Deserializer接口的三个函数：

a）初始化：initialize(Configuration conf, Properties tb1)。

b）反序列化Writable类型返回Object:deserialize(Writable blob)。

c）获取deserialize(Writable blob)返回值Object的inspector:getObjectInspector()。

public interface Deserializer {

public void initialize(Configuration conf, Properties tbl) throws SerDeException;

public Object deserialize(Writable blob) throws SerDeException;

public ObjectInspector getObjectInspector() throws SerDeException;

}

实现一行数据划分成hive表的time,userid,host,path四个字段的反序列化类。例如：

package hive.connect;

import java.net.MalformedURLException;

import java.net.URL;

import java.util.ArrayList;

import java.util.List;

import java.util.Properties;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hive.serde2.Deserializer;

import org.apache.hadoop.hive.serde2.SerDeException;

import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;

import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;

import org.apache.hadoop.hive.serde2.objectinspector.-

ObjectInspectorFactory.ObjectInspectorOptions;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.Writable;

public class TestDeserializer implements Deserializer {

private static List<String> FieldNames = new ArrayList<String>();

private static List<ObjectInspector> FieldNamesObjectInspectors = new ArrayList<ObjectInspector>();

static {

FieldNames.add("time");

FieldNamesObjectInspectors.add(ObjectInspectorFactory

.getReflectionObjectInspector(Long.class,

ObjectInspectorOptions.JAVA));

FieldNames.add("userid");

FieldNamesObjectInspectors.add(ObjectInspectorFactory

.getReflectionObjectInspector(Integer.class,

ObjectInspectorOptions.JAVA));

FieldNames.add("host");

FieldNamesObjectInspectors.add(ObjectInspectorFactory

.getReflectionObjectInspector(String.class,

ObjectInspectorOptions.JAVA));

FieldNames.add("path");

FieldNamesObjectInspectors.add(ObjectInspectorFactory

.getReflectionObjectInspector(String.class,

ObjectInspectorOptions.JAVA));

}

@Override

public Object deserialize(Writable blob) {

try {

if (blob instanceof Text) {

String line = ((Text) blob).toString();

if (line == null)

return null;

String[] field = line.split("\t");

if (field.length != 3) {

return null;

}

List<Object> result = new ArrayList<Object>();

URL url = new URL(field[2]);

Long time = Long.valueOf(field[0]);

Integer userid = Integer.valueOf(field[1]);

result.add(time);

result.add(userid);

result.add(url.getHost());

result.add(url.getPath());

return result;

}

} catch (MalformedURLException e) {

e.printStackTrace();

}

return null;

}

@Override

public ObjectInspector getObjectInspector() throws SerDeException {

return ObjectInspectorFactory.getStandardStructObjectInspector(

FieldNames, FieldNamesObjectInspectors);

}

@Override

public void initialize(Configuration arg0, Properties arg1)

throws SerDeException {

}

测试HDFS上hive表数据，如下为一条测试数据：

1234567891012 123456 http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF

hive> add jar /run/jar/merg_hua.jar;

Added /run/jar/merg_hua.jar to class path

hive> create table serde_table row format serde 'hive.connect.TestDeserializer';

Found class for hive.connect.TestDeserializer

Time taken: 0.028 seconds

hive> describe serde_table;

time bigint from deserializer

userid int from deserializer

host string from deserializer

path string from deserializer

Time taken: 0.042 seconds

hive> select * from serde_table;

1234567891012 123456 wiki.apache.org /hadoop/Hive/LanguageManual/UDF

Time taken: 0.039 seconds

三、总结

1、创建Hive表使用序列化时，需要自写一个实现Deserializer的类，并且选用create命令的row format参数。

2、在处理海量数据的时候，如果数据的格式与表结构吻合，可以用到Hive的反序列化而不需要对数据进行转换，可以节省大量的时间。

SQL 之 round（）函数

四舍五入的函数。 SQL

经验分享 | 笔记 - Notion

“A unified & collaborative workspace for your notes, wikis, and tasks.”

【typescript】Typescript联合类型类型断言、Typescript 泛型函数以及使用场景

一.Typescript联合类型类型断言 interface myinterface { name: string ren(): void } interface youinterf

HDU 6540 Neko and tree（树形DP，计数类DP）

mysql 日志实现_将日志记录在是mysql中，实现loganalyzer

三台主机 192.168.191.106(代号106) 产生日志 192.168.191.107(代号107) 实现存放日志的数据库 192.168.191.173(代号173) 实现日志报表 1、实现rsysl

C++常用的调用约定cdecl &stdcall

函数调用约定：当一个函数被调用时，函数的参数会被传递给被调用的函数和返回值会被返回给调用函数。函数的调用约定就是描述参数是怎么传递和由谁平衡堆栈的，当然还有返回值。函数调用约定的主要约束事件：

虚拟内存，虚拟内存地址，物理内存，物理内存地址

Java 泛型通配符 T，E，K，V，?

Java 泛型通配符 T，E，K，V，? 前言 Java 泛型（generics）是 JDK 5 中引入的一个新特性, 泛型提供了编译时类型安全检测机制，该机制允许开发者在编译时检测到非法

数据仓库建模和ETL实践技巧

一、数据仓库的架构数据仓库（Data Warehouse DW）是为了便于多维分析和多角度展现而将数据按特定的模式进行存储所建立起来的关系型数据库，它的数据基于OLTP源系统。数据仓库中的数据是细节的、集成的、面向主题的，以OLAP系统的分析需求为

Weibull Distribution韦布尔分布的深入详述（1）原理和公式

1 前言：韦伯分布被经常用来对失效性（time to Failure）或者，反而言之为，可靠性，进行衡量的工具。他的目标就是构建一个失效性分析的模型，或者说构建一个失效性分析的Pattern. 失效性可用于很多领域，包括存储器元器件、机械抗疲劳