Apache CarbonData里程碑式版本1.3发布_开源_陈亮



 写点什么

看新闻很累？看技术新闻更累？试试下载 InfoQ 手机客户端，每天上下班路上听新闻，有趣还有料！

CarbonData 是一种新的高性能数据存储格式，已在 20+ 企业生产环境上部署和使应用，企业数据规模达到万亿。针对当前大数据领域分析场景需求各异而导致的存储冗余问题，业务驱动下的数据分析灵活性要求越来越高，CarbonData 提供了一种新的融合数据存储方案，以一份数据同时支持多种应用场景，并通过多级索引、字典编码、预聚合、动态 Partition、准实时数据查询、列存等特性提升了 IO 扫描和计算性能，实现百亿数据级秒级响应。

我们来看下，CarbonData 1.3.0 有哪些重大特性：

1. 支持与 Spark 2.2.1 集成

CarbonData 1.3.0 支持与最新稳定的 Spark 2.2.1 版本集成。

2. 支持预聚合表特性

在 1.3.0 中，CarbonData 的预聚合特性，与传统 BI 系统的 CUBE 方案最大区别是，用户不需要改任何 SQL 语句，既可加速 group by 的统计性能，又可查询明细数据，做到一份数据满足多种应用场景。具体的用法如下：

a) 创建主表：

复制代码

 CREATE TABLE sales (
order_time TIMESTAMP,
user_id STRING,
sex STRING,
country STRING,
quantity INT,
price BIGINT)
STORED BY 'carbondata'

b) 基于上面主表 sales 创建预聚合表：

复制代码

 CREATE DATAMAP agg_sales
ON TABLE sales
USING "preaggregate"
AS
SELECT country, sex, sum(quantity), avg(price)
FROM sales
GROUP BY country, sex

c) 用户不需要改 SQL 语句，基于主表 sales 的查询语句如命中预聚合表 agg_sales，可以显著提升查询性能：

复制代码

 SELECT country, sex, sum(quantity), avg(price) FROM sales GROUP BY country, sex；// 命中，完全和聚合表一样
 
SELECT sex, sum(quantity) FROM sales GROUP BY sex；// 命中，聚合表的部分查询
 
SELECT country, avg(price) FROM sales GROUP BY country；// 命中，聚合表的部分查询
 
SELECT country, sum(price) FROM sales GROUP BY country；// 命中，因为聚合表里 avg(price) 是通过 sum(price)/count(price) 产生，所以 sum(price) 也命中
 
SELECT sex, avg(quantity) FROM sales GROUP BY sex; // 没命中，需要创建新的预聚合表
 
SELECT max(price), country FROM sales GROUP BY country；// 没命中，需要创建新的预聚合表
 
SELECT user_id, country, sex, sum(quantity), avg(price) FROM sales GROUP BY user_id, country, sex; // 没命中，需要创建新的预聚合表

d) 在 3.0 版本中，支持的预聚合表达式有：SUM、AVG、MAX、MIN、COUNT

e) 实测性能可提升 10+ 倍以上，大家可以参考例子，把测试数据调到 1 亿规模以上，跑下这个例子：/apache/carbondata/examples/PreAggregateTableExample.scala

3. 支持时间维度的预聚合特性，并支持自动上卷

此特性为 Alpha 特性，当前时间粒度支持设置为 1，比如：支持按 1 天聚合，暂不支持指定 3 天，5 天的粒度进行聚合，下个版本将支持。支持自动上卷（Year,Month,Day,Hour,Minute），具体用法如下：

a) 创建主表：

复制代码

 CREATE TABLE sales (
order_time TIMESTAMP,
user_id STRING,
sex STRING,
country STRING,
quantity INT,
price BIGINT)
STORED BY 'carbondata'

b) 分别创建 Year、Month、Day、Hour、Minute 粒度的聚合表：

复制代码

 CREATE DATAMAP agg_year
ON TABLE sales
USING "timeseries"
DMPROPERTIES (
'event_time’=’order_time’,
'year_granualrity’=’1’,
) AS
SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
 avg(price) FROM sales GROUP BY order_time, country, sex
 
CREATE DATAMAP agg_month
ON TABLE sales
USING "timeseries"
DMPROPERTIES (
'event_time’=’order_time’,
'month_granualrity’=’1’,
) AS
SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
 avg(price) FROM sales GROUP BY order_time, country, sex
 
CREATE DATAMAP agg_day
  ON TABLE sales
  USING "timeseries"
  DMPROPERTIES (
  'event_time’=’order_time’,
  'day_granualrity’=’1’, // 当前粒度只支持设置为 1,
  ) AS
  SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
   avg(price) FROM sales GROUP BY order_time, country, sex
 
CREATE DATAMAP agg_sales_hour
ON TABLE sales
USING "timeseries"
DMPROPERTIES (
'event_time’=’order_time’,
'hour_granualrity’=’1’,
) AS
SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
 avg(price) FROM sales GROUP BY order_time, country, sex
 
CREATE DATAMAP agg_minute
ON TABLE sales
USING "timeseries"
DMPROPERTIES (
'event_time’=’order_time’,
'minute_granualrity’=’1’,
) AS
SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
 avg(price) FROM sales GROUP BY order_time, country, sex

c) 用户可不用创建所有时间粒度的聚合表，系统支持自动 roll-up 上卷，如：已创建了 Day 粒度的聚合表，当查询 Year、Month 粒度的 group by 聚合时，系统会基于已聚合好的 Day 粒度值推算出 Year、Month 粒度的聚合值：

复制代码

 CREATE DATAMAP agg_day
  ON TABLE sales
  USING "timeseries"
  DMPROPERTIES (
  'event_time’=’order_time’,
  'day_granualrity’=’1’,
  ) AS
  SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
   avg(price) FROM sales GROUP BY order_time, country, sex

（Year、Month 粒度的聚合查询，可用上面创建的 agg_day 上卷）

复制代码

 SELECT timeseries(order_time, ‘month’), sum(quantity) FROM sales group by timeseries(order_time,
  ’month’)
SELECT timeseries(order_time, ‘year’), sum(quantity) FROM sales group by timeseries(order_time,
  ’year’)

4. 支持实时入库，准实时查询

在 1.3.0 中，支持通过 Structured Streaming 实时导入数据到 CarbonData 表，并立即可查询这些 fresh 数据。

a) 实时获取数据：

复制代码

 val readSocketDF = spark.readStream
   .format("socket")
   .option("host", "localhost")
   .option("port", 9099)
   .load()

b) 写数据到 CarbonData 表

复制代码

 qry = readSocketDF.writeStream
   .format("carbondata")
   .trigger(ProcessingTime("5 seconds"))
   .option("checkpointLocation", tablePath.getStreamingCheckpointDir)
   .option("dbName", "default")
   .option("tableName", "carbon_table")
   .start()

（具体可参考例子 /apache/carbondata/examples/CarbonStructuredStreamingExample.scala）

5. 支持标准的 Partition 特性：

此 Partition 和 Hive 和 Spark partition 一样，用户可以按字段值建立 partition 分区，查询时可指定具体分区数据进行快速查询；与 SORT_COLUMNS 组合应用，可以建立多级排序，满足任意维度组合的过滤查询，做到一份数据满足多种应用场景。如：创建下面表，设置 productDate 作为 partition 字段，数据按天进行分区；再通过 SORT_COLUMNS 建立多维 MDK 索引。这样可以按照 productDate，productName, storeProvince, storeCity 任意过滤组合快速查询数据。

复制代码

 CREATE TABLE IF NOT EXISTS productSalesTable (
productName STRING,
  storeProvince STRING,
storeCity STRING,
  saleQuantity INT,
  revenue INT)
PARTITIONED BY (productDate DATE)
STORED BY 'carbondata'
TBLPROPERTIES(‘SORT_COLUMNS’ = ‘productName, storeProvince, storeCity’)

6. 支持 CREATE TABLE AS SELECT 语法

CREATE TABLE carbon_table STORED BY 'carbondata' AS SELECT * FROM parquet_table7. 支持指定导入的数据进行查询

CarbonData 每批次导入的数据，会放到一个 segment 下，在 1.3.0 里用户可以指定 segment 数据进行查询，即：用户可以指定数据批次按需查询。

a) 查询 Segment ID 列表

SHOW SEGMENTS FOR TABLE <databasename>.<table_name>b) 设置 Segment ID

SET carbon.input.segments.<databasename>.<table_name> = <list of segment IDs>(具体可参考例子：/apache/carbondata/examples/QuerySegmentExample.scala)

8. Apache CarbonData**** 官网：apache.org

1.3.0下载地址。

发布

暂无评论

创作场景

Apache CarbonData 里程碑式版本 1.3 发布

评论

完蛋，我的事务怎么不生效？

Golang协程之了解管道的缓存能力

【转】java开发之spring面试题

共筑AI开源繁荣生态 | 新一代人工智能院士高峰论坛深度学习框架分论坛成功举办

PassJava 开源 (九) ：Spring Cloud 整合 Gateway 网关

华为与湖北三所高校共建首批鲲鹏&昇腾产教融合育人基地

饿了么资深架构师分享云上基础架构演进

react源码解析18事件系统

滴滴数据通道服务演进之路

初探语音识别ASR算法

10个Node.js 开发人员必须使用的IDE

Soul运维总监尤首智：企业如何从0到1建设云上运维体系

一文带你了解数据库连接池的必要性

云原生时代，需要什么样的数据库？

最强最全面的大数据SQL系列

OPPO小布助手算法系统探索、实践与思考

版本不兼容Jar包冲突该如何是好？

面试被问一致性hash？看这一篇就够了

性能分析之Linux系统平均负载案例分析

Arctic：网易数帆开放式流批一体表服务 | BDTC 精彩回顾

PingCAP x 亚马逊云科技，为 TiDB 云端体验“加冕”

迭代你好，我是冲刺

技术揭秘！百度搜索中台低代码的探索与实践

基于流程管理，提高工作质量和效率

kafka丢失和重复消费数据

如何让TiDB在云上智能运维（TiDB Hackathon 赛题）

网络安全好学吗？网络安全入门篇，安装渗透测试系统kali全套教学

给弟弟的信第26封|做一个懂得感恩的人

SpringBoot应用和PostgreSQL数据库部署到Kubernetes上的一个例子

教你Python字符串的基本操作：拆分和连接

大数据开发Hive之如何进行数据抽样

	CREATE TABLE sales (
	order_time TIMESTAMP,
	user_id STRING,
	sex STRING,
	country STRING,
	quantity INT,
	price BIGINT)
	STORED BY 'carbondata'

	CREATE DATAMAP agg_sales
	ON TABLE sales
	USING "preaggregate"
	AS
	SELECT country, sex, sum(quantity), avg(price)
	FROM sales
	GROUP BY country, sex

	SELECT country, sex, sum(quantity), avg(price) FROM sales GROUP BY country, sex；// 命中，完全和聚合表一样

	SELECT sex, sum(quantity) FROM sales GROUP BY sex；// 命中，聚合表的部分查询

	SELECT country, avg(price) FROM sales GROUP BY country；// 命中，聚合表的部分查询

	SELECT country, sum(price) FROM sales GROUP BY country；// 命中，因为聚合表里 avg(price) 是通过 sum(price)/count(price) 产生，所以 sum(price) 也命中

	SELECT sex, avg(quantity) FROM sales GROUP BY sex; // 没命中，需要创建新的预聚合表

	SELECT max(price), country FROM sales GROUP BY country；// 没命中，需要创建新的预聚合表

	SELECT user_id, country, sex, sum(quantity), avg(price) FROM sales GROUP BY user_id, country, sex; // 没命中，需要创建新的预聚合表

	CREATE DATAMAP agg_year
	ON TABLE sales
	USING "timeseries"
	DMPROPERTIES (
	'event_time’=’order_time’,
	'year_granualrity’=’1’,
	) AS
	SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
	avg(price) FROM sales GROUP BY order_time, country, sex

	CREATE DATAMAP agg_month
	ON TABLE sales
	USING "timeseries"
	DMPROPERTIES (
	'event_time’=’order_time’,
	'month_granualrity’=’1’,
	) AS
	SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
	avg(price) FROM sales GROUP BY order_time, country, sex

	CREATE DATAMAP agg_day
	ON TABLE sales
	USING "timeseries"
	DMPROPERTIES (
	'event_time’=’order_time’,
	'day_granualrity’=’1’, // 当前粒度只支持设置为 1,
	) AS
	SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
	avg(price) FROM sales GROUP BY order_time, country, sex

	CREATE DATAMAP agg_sales_hour
	ON TABLE sales
	USING "timeseries"
	DMPROPERTIES (
	'event_time’=’order_time’,
	'hour_granualrity’=’1’,
	) AS
	SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
	avg(price) FROM sales GROUP BY order_time, country, sex

	CREATE DATAMAP agg_minute
	ON TABLE sales
	USING "timeseries"
	DMPROPERTIES (
	'event_time’=’order_time’,
	'minute_granualrity’=’1’,
	) AS
	SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
	avg(price) FROM sales GROUP BY order_time, country, sex

	SELECT timeseries(order_time, ‘month’), sum(quantity) FROM sales group by timeseries(order_time,
	’month’)
	SELECT timeseries(order_time, ‘year’), sum(quantity) FROM sales group by timeseries(order_time,
	’year’)

	val readSocketDF = spark.readStream
	.format("socket")
	.option("host", "localhost")
	.option("port", 9099)
	.load()

	qry = readSocketDF.writeStream
	.format("carbondata")
	.trigger(ProcessingTime("5 seconds"))
	.option("checkpointLocation", tablePath.getStreamingCheckpointDir)
	.option("dbName", "default")
	.option("tableName", "carbon_table")
	.start()

	CREATE TABLE IF NOT EXISTS productSalesTable (
	productName STRING,
	storeProvince STRING,
	storeCity STRING,
	saleQuantity INT,
	revenue INT)
	PARTITIONED BY (productDate DATE)
	STORED BY 'carbondata'
	TBLPROPERTIES(‘SORT_COLUMNS’ = ‘productName, storeProvince, storeCity’)

创作场景

Apache CarbonData 里程碑式版本 1.3 发布

评论

更多内容推荐

推荐阅读

电子书

大厂实战PPT下载