热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

ApacheDrill

2019独角兽企业重金招聘Python工程师标准WhyDrillTop10ReasonstoUseDrill1.GetstartedinminutesIttakesacoup

2019独角兽企业重金招聘Python工程师标准>>> hot3.png

Why Drill

Top 10 Reasons to Use Drill

1. Get started in minutes

It takes a couple of minutes to start working with Drill. Untar the Drill software on your Mac or Windows laptop and run a query on a local file. No need to set up any infrastructure or to define schemas. Just point to the data, such as data in a file, directory, HBase table, and drill.

$ tar -xvf apache-drill-.tar.gz
$ /bin/drill-embedded
0: jdbc:drill:zk=local> SELECT * FROM cp.`employee.json` LIMIT 5;
+--------------+----------------------------+---------------------+---------------+--------------+----------------------------+-----------+----------------+-------------+------------------------+----------+----------------+----------------------+-----------------+---------+-----------------------+
| employee_id | full_name | first_name | last_name | position_id | position_title | store_id | department_id | birth_date | hire_date | salary | supervisor_id | education_level | marital_status | gender | management_role |
+--------------+----------------------------+---------------------+---------------+--------------+----------------------------+-----------+----------------+-------------+------------------------+----------+----------------+----------------------+-----------------+---------+-----------------------+
| 1 | Sheri Nowmer | Sheri | Nowmer | 1 | President | 0 | 1 | 1961-08-26 | 1994-12-01 00:00:00.0 | 80000.0 | 0 | Graduate Degree | S | F | Senior Management |
| 2 | Derrick Whelply | Derrick | Whelply | 2 | VP Country Manager | 0 | 1 | 1915-07-03 | 1994-12-01 00:00:00.0 | 40000.0 | 1 | Graduate Degree | M | M | Senior Management |
| 4 | Michael Spence | Michael | Spence | 2 | VP Country Manager | 0 | 1 | 1969-06-20 | 1998-01-01 00:00:00.0 | 40000.0 | 1 | Graduate Degree | S | M | Senior Management |
| 5 | Maya Gutierrez | Maya | Gutierrez | 2 | VP Country Manager | 0 | 1 | 1951-05-10 | 1998-01-01 00:00:00.0 | 35000.0 | 1 | Bachelors Degree | M | F | Senior Management |

第一个原因无非是说drill安装简单,同时支持多了数据源如文件,hbase等其他数据源。


2. Schema-free JSON model

Drill is the world's first and only distributed SQL engine that doesn't require schemas. It shares the same schema-free JSON model as MongoDB and Elasticsearch. No need to define and maintain schemas or transform data (ETL). Drill automatically understands the structure of the data.

这个个人其实还是比较喜欢的一个功能,对于之前的hive,hbase等系统来说,在查询数据之前,我们必须要去定义table的元数据信息。而drill本身是不保存元数据信息的,元数据信息是保存在本身的底层存储中,这样使得drill其实更可以关注与自身的查询性能。

3. Query complex, semi-structured data in-situ

Using Drill's schema-free JSON model, you can query complex, semi-structured data in situ. No need to flatten or transform the data prior to or during query execution. Drill also provides intuitive extensions to SQL to work with nested data. Here's a simple query on a JSON file demonstrating how to access nested elements and arrays:

SELECT * FROM (SELECT t.trans_id,t.trans_info.prod_id[0] AS prod_id,t.trans_info.purch_flag AS purchasedFROM `clicks/clicks.json` t) sq
WHERE sq.prod_id BETWEEN 700 AND 750 ANDsq.purchased = 'true'
ORDER BY sq.prod_id;

支持复杂的sql查询

4. Real SQL -- not "SQL-like"

Drill supports the standard SQL:2003 syntax. No need to learn a new "SQL-like" language or struggle with a semi-functional BI tool. Drill supports many data types including DATE, INTERVALDAY/INTERVALYEAR, TIMESTAMP, and VARCHAR, as well as complex query constructs such as correlated sub-queries and joins in WHERE clauses. Here is an example of a TPC-H standard query that runs in Drill "as is":

TPC-H query 4

SELECT o.o_orderpriority, count(*) AS order_count
FROM orders o
WHERE o.o_orderdate >= date '1996-10-01'AND o.o_orderdate

之前的hive或者是建立在hbase上的phoenix其实本质都是sql-like,无法兼容sql92,使得很多sql其实是需要一些改造的。drill就是完全支持sql。

5. Leverage standard BI tools

Drill works with standard BI tools. You can use your existing tools, such as Tableau, MicroStrategy, QlikView and Excel.

其实由于支持jdbc,odcb,同时具备低延迟的特性使得drill完全可以对接BI工具。

6. Interactive queries on Hive tables

Apache Drill lets you leverage your investments in Hive. You can run interactive queries with Drill on your Hive tables and access all Hive input/output formats (including custom SerDes). You can join tables associated with different Hive metastores, and you can join a Hive table with an HBase table or a directory of log files. Here's a simple query in Drill on a Hive table:

SELECT `month`, state, sum(order_total) AS sales
FROM hive.orders
GROUP BY `month`, state
ORDER BY 3 DESC LIMIT 5;

支持hive不解释

7. Access multiple data sources

Drill is extensible. You can connect Drill out-of-the-box to file systems (local or distributed, such as S3, HDFS and MapR-FS), HBase and Hive. You can implement a storage plugin to make Drill work with any other data source. Drill can combine data from multiple data sources on the fly in a single query, with no centralized metadata definitions. Here's a query that combines data from a Hive table, an HBase table (view) and a JSON file:

SELECT custview.membership, sum(orders.order_total) AS sales
FROM hive.orders, custview, dfs.`clicks/clicks.json` c
WHERE orders.cust_id = custview.cust_id AND orders.cust_id = c.user_info.cust_id
GROUP BY custview.membership
ORDER BY 2;

这个其实还是略屌的功能,使得管理员无需在对数据进行集中化的存储。drill支持从多数据源进行数据的join查询。

8. User-Defined Functions (UDFs)

Drill exposes a simple and high-performance Java API to build custom functions (UDFs and UDAFs) so that you can add your own business logic. If you have already built UDFs in Hive, you can reuse them with Drill with no modifications. Refer to [Developing Custom Functions]((/docs/develop-custom-functions/) for more information.

UDF功能,不解释,用过hive应该都知道的。其实udf功能还是必须的,因为很多时候对数据的复杂处理是在udf中的。

9. High performance

Drill is designed from the ground up for high throughput and low latency. It doesn't use a general purpose execution engine like MapReduce, Tez or Spark. As a result, Drill is flexible (schema-free JSON model) and performant. Drill's optimizer leverages rule- and cost-based techniques, as well as data locality and operator push-down, which is the capability to push down query fragments into the back-end data sources. Drill also provides a columnar and vectorized execution engine, resulting in higher memory and CPU efficiency.

高性能,其实drill的定位应该是类似OLAP的查询或者是即席查询,低延迟。当然我们也无法给出究竟多少秒算低延迟。其实我们可以看见低延迟的关键几点技术:本地化,操作下发。大数据集群就是要避开移动数据,尽量本地化,减少网络IO,而操作下发,其实可以理解为整个集群是一个金字塔,任务一层一层向下处理,结果在向上一层一层回送聚合。这个有点google的bigquery的感觉。

10. Scales from a single laptop to a 1000-node cluster

Drill is available as a simple download you can run on your laptop. When you're ready to analyze larger datasets, deploy Drill on your Hadoop cluster (up to 1000 commodity servers). Drill leverages the aggregate memory in the cluster to execute queries using an optimistic pipelined model, and automatically spills to disk when the working set doesn't fit in memory.

这个应该是支持的,hadoop据说都可以到5000,不过觉得gc的不断优化,应该可以突破。

========================================================华丽分割线=====================================================

传送门:http://drill.apache.org/docs     有兴趣的可以看看google的dreml的论文


转载于:https://my.oschina.net/hadooper/blog/419233


推荐阅读
  • 20100423:Fixes:更新批处理,以兼容WIN7。第一次系统地玩QT,于是诞生了此预备式:【QT版本4.6.0&#x ... [详细]
  • 使用Pandas高效读取SQL脚本中的数据
    本文详细介绍了如何利用Pandas直接读取和解析SQL脚本,提供了一种高效的数据处理方法。该方法适用于各种数据库导出的SQL脚本,并且能够显著提升数据导入的速度和效率。 ... [详细]
  • 本题旨在通过给定的评级信息,利用拓扑排序和并查集算法来确定全球 Tetris 高手排行榜。题目要求判断是否可以根据提供的信息生成一个明确的排名表,或者是否存在冲突或信息不足的情况。 ... [详细]
  • 本文详细介绍了如何准备和安装 Eclipse 开发环境及其相关插件,包括 JDK、Tomcat、Struts 等组件的安装步骤及配置方法。 ... [详细]
  • 通过Web界面管理Linux日志的解决方案
    本指南介绍了一种利用rsyslog、MariaDB和LogAnalyzer搭建集中式日志管理平台的方法,使用户可以通过Web界面查看和分析Linux系统的日志记录。此方案不仅适用于服务器环境,还提供了详细的步骤来确保系统的稳定性和安全性。 ... [详细]
  • 本文详细探讨了 Django 的 ORM(对象关系映射)机制,重点介绍了其如何通过 Python 元类技术实现数据库表与 Python 类的映射。此外,文章还分析了 Django 中各种字段类型的继承结构及其与数据库数据类型的对应关系。 ... [详细]
  • 数据结构入门:栈的基本概念与操作
    本文详细介绍了栈这一重要的数据结构,包括其基本概念、顺序存储结构、栈的基本操作(如入栈、出栈、清空栈和销毁栈),以及如何利用栈实现二进制到十进制的转换。通过具体代码示例,帮助读者更好地理解和应用栈的相关知识。 ... [详细]
  • 本文介绍如何在SQL Server中对Name列进行排序,使特定值(如Default Deliverable Submission Notification)显示在结果集的顶部。 ... [详细]
  • 本文详细介绍了网络存储技术的基本概念、分类及应用场景。通过分析直连式存储(DAS)、网络附加存储(NAS)和存储区域网络(SAN)的特点,帮助读者理解不同存储方式的优势与局限性。 ... [详细]
  • 本文介绍了Linux系统中的文件IO操作,包括文件描述符、基本文件操作函数以及目录操作。详细解释了各个函数的参数和返回值,并提供了代码示例。 ... [详细]
  • 深入解析Redis内存对象模型
    本文详细介绍了Redis内存对象模型的关键知识点,包括内存统计、内存分配、数据存储细节及优化策略。通过实际案例和专业分析,帮助读者全面理解Redis内存管理机制。 ... [详细]
  • Python第三方库安装的多种途径及注意事项
    本文详细介绍了Python第三方库的几种常见安装方法,包括使用pip命令、集成开发环境(如Anaconda)以及手动文件安装,并提供了每种方法的具体操作步骤和适用场景。 ... [详细]
  • This pull request introduces the ability to provide comprehensive paragraph configurations directly within the Create Note and Create Paragraph REST endpoints, reducing the need for additional configuration calls. ... [详细]
  • 磁盘健康检查与维护
    在计算机系统运行过程中,硬件或电源故障可能会导致文件系统出现异常。为确保数据完整性和系统稳定性,定期进行磁盘健康检查至关重要。本文将详细介绍如何使用fsck和badblocks工具来检测和修复文件系统及硬盘扇区的潜在问题。 ... [详细]
  • 在成功安装和测试MySQL及Apache之后,接下来的步骤是安装PHP。为了确保安全性和配置的一致性,建议在安装PHP前先停止MySQL和Apache服务,并将MySQL集成到PHP中。 ... [详细]
author-avatar
笃志单车小博_801
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有