热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

ApacheDrill

2019独角兽企业重金招聘Python工程师标准WhyDrillTop10ReasonstoUseDrill1.GetstartedinminutesIttakesacoup

2019独角兽企业重金招聘Python工程师标准>>> hot3.png

Why Drill

Top 10 Reasons to Use Drill

1. Get started in minutes

It takes a couple of minutes to start working with Drill. Untar the Drill software on your Mac or Windows laptop and run a query on a local file. No need to set up any infrastructure or to define schemas. Just point to the data, such as data in a file, directory, HBase table, and drill.

$ tar -xvf apache-drill-.tar.gz
$ /bin/drill-embedded
0: jdbc:drill:zk=local> SELECT * FROM cp.`employee.json` LIMIT 5;
+--------------+----------------------------+---------------------+---------------+--------------+----------------------------+-----------+----------------+-------------+------------------------+----------+----------------+----------------------+-----------------+---------+-----------------------+
| employee_id | full_name | first_name | last_name | position_id | position_title | store_id | department_id | birth_date | hire_date | salary | supervisor_id | education_level | marital_status | gender | management_role |
+--------------+----------------------------+---------------------+---------------+--------------+----------------------------+-----------+----------------+-------------+------------------------+----------+----------------+----------------------+-----------------+---------+-----------------------+
| 1 | Sheri Nowmer | Sheri | Nowmer | 1 | President | 0 | 1 | 1961-08-26 | 1994-12-01 00:00:00.0 | 80000.0 | 0 | Graduate Degree | S | F | Senior Management |
| 2 | Derrick Whelply | Derrick | Whelply | 2 | VP Country Manager | 0 | 1 | 1915-07-03 | 1994-12-01 00:00:00.0 | 40000.0 | 1 | Graduate Degree | M | M | Senior Management |
| 4 | Michael Spence | Michael | Spence | 2 | VP Country Manager | 0 | 1 | 1969-06-20 | 1998-01-01 00:00:00.0 | 40000.0 | 1 | Graduate Degree | S | M | Senior Management |
| 5 | Maya Gutierrez | Maya | Gutierrez | 2 | VP Country Manager | 0 | 1 | 1951-05-10 | 1998-01-01 00:00:00.0 | 35000.0 | 1 | Bachelors Degree | M | F | Senior Management |

第一个原因无非是说drill安装简单,同时支持多了数据源如文件,hbase等其他数据源。


2. Schema-free JSON model

Drill is the world's first and only distributed SQL engine that doesn't require schemas. It shares the same schema-free JSON model as MongoDB and Elasticsearch. No need to define and maintain schemas or transform data (ETL). Drill automatically understands the structure of the data.

这个个人其实还是比较喜欢的一个功能,对于之前的hive,hbase等系统来说,在查询数据之前,我们必须要去定义table的元数据信息。而drill本身是不保存元数据信息的,元数据信息是保存在本身的底层存储中,这样使得drill其实更可以关注与自身的查询性能。

3. Query complex, semi-structured data in-situ

Using Drill's schema-free JSON model, you can query complex, semi-structured data in situ. No need to flatten or transform the data prior to or during query execution. Drill also provides intuitive extensions to SQL to work with nested data. Here's a simple query on a JSON file demonstrating how to access nested elements and arrays:

SELECT * FROM (SELECT t.trans_id,t.trans_info.prod_id[0] AS prod_id,t.trans_info.purch_flag AS purchasedFROM `clicks/clicks.json` t) sq
WHERE sq.prod_id BETWEEN 700 AND 750 ANDsq.purchased = 'true'
ORDER BY sq.prod_id;

支持复杂的sql查询

4. Real SQL -- not "SQL-like"

Drill supports the standard SQL:2003 syntax. No need to learn a new "SQL-like" language or struggle with a semi-functional BI tool. Drill supports many data types including DATE, INTERVALDAY/INTERVALYEAR, TIMESTAMP, and VARCHAR, as well as complex query constructs such as correlated sub-queries and joins in WHERE clauses. Here is an example of a TPC-H standard query that runs in Drill "as is":

TPC-H query 4

SELECT o.o_orderpriority, count(*) AS order_count
FROM orders o
WHERE o.o_orderdate >= date '1996-10-01'AND o.o_orderdate

之前的hive或者是建立在hbase上的phoenix其实本质都是sql-like,无法兼容sql92,使得很多sql其实是需要一些改造的。drill就是完全支持sql。

5. Leverage standard BI tools

Drill works with standard BI tools. You can use your existing tools, such as Tableau, MicroStrategy, QlikView and Excel.

其实由于支持jdbc,odcb,同时具备低延迟的特性使得drill完全可以对接BI工具。

6. Interactive queries on Hive tables

Apache Drill lets you leverage your investments in Hive. You can run interactive queries with Drill on your Hive tables and access all Hive input/output formats (including custom SerDes). You can join tables associated with different Hive metastores, and you can join a Hive table with an HBase table or a directory of log files. Here's a simple query in Drill on a Hive table:

SELECT `month`, state, sum(order_total) AS sales
FROM hive.orders
GROUP BY `month`, state
ORDER BY 3 DESC LIMIT 5;

支持hive不解释

7. Access multiple data sources

Drill is extensible. You can connect Drill out-of-the-box to file systems (local or distributed, such as S3, HDFS and MapR-FS), HBase and Hive. You can implement a storage plugin to make Drill work with any other data source. Drill can combine data from multiple data sources on the fly in a single query, with no centralized metadata definitions. Here's a query that combines data from a Hive table, an HBase table (view) and a JSON file:

SELECT custview.membership, sum(orders.order_total) AS sales
FROM hive.orders, custview, dfs.`clicks/clicks.json` c
WHERE orders.cust_id = custview.cust_id AND orders.cust_id = c.user_info.cust_id
GROUP BY custview.membership
ORDER BY 2;

这个其实还是略屌的功能,使得管理员无需在对数据进行集中化的存储。drill支持从多数据源进行数据的join查询。

8. User-Defined Functions (UDFs)

Drill exposes a simple and high-performance Java API to build custom functions (UDFs and UDAFs) so that you can add your own business logic. If you have already built UDFs in Hive, you can reuse them with Drill with no modifications. Refer to [Developing Custom Functions]((/docs/develop-custom-functions/) for more information.

UDF功能,不解释,用过hive应该都知道的。其实udf功能还是必须的,因为很多时候对数据的复杂处理是在udf中的。

9. High performance

Drill is designed from the ground up for high throughput and low latency. It doesn't use a general purpose execution engine like MapReduce, Tez or Spark. As a result, Drill is flexible (schema-free JSON model) and performant. Drill's optimizer leverages rule- and cost-based techniques, as well as data locality and operator push-down, which is the capability to push down query fragments into the back-end data sources. Drill also provides a columnar and vectorized execution engine, resulting in higher memory and CPU efficiency.

高性能,其实drill的定位应该是类似OLAP的查询或者是即席查询,低延迟。当然我们也无法给出究竟多少秒算低延迟。其实我们可以看见低延迟的关键几点技术:本地化,操作下发。大数据集群就是要避开移动数据,尽量本地化,减少网络IO,而操作下发,其实可以理解为整个集群是一个金字塔,任务一层一层向下处理,结果在向上一层一层回送聚合。这个有点google的bigquery的感觉。

10. Scales from a single laptop to a 1000-node cluster

Drill is available as a simple download you can run on your laptop. When you're ready to analyze larger datasets, deploy Drill on your Hadoop cluster (up to 1000 commodity servers). Drill leverages the aggregate memory in the cluster to execute queries using an optimistic pipelined model, and automatically spills to disk when the working set doesn't fit in memory.

这个应该是支持的,hadoop据说都可以到5000,不过觉得gc的不断优化,应该可以突破。

========================================================华丽分割线=====================================================

传送门:http://drill.apache.org/docs     有兴趣的可以看看google的dreml的论文


转载于:https://my.oschina.net/hadooper/blog/419233


推荐阅读
  • 本文介绍如何使用布局文件在Android应用中排列多行TextView和Button,使其占据屏幕的特定比例,并提供示例代码以帮助理解和实现。 ... [详细]
  • 实体映射最强工具类:MapStruct真香 ... [详细]
  • 深入解析Spring启动过程
    本文详细介绍了Spring框架的启动流程,帮助开发者理解其内部机制。通过具体示例和代码片段,解释了Bean定义、工厂类、读取器以及条件评估等关键概念,使读者能够更全面地掌握Spring的初始化过程。 ... [详细]
  • 本文将介绍如何编写一些有趣的VBScript脚本,这些脚本可以在朋友之间进行无害的恶作剧。通过简单的代码示例,帮助您了解VBScript的基本语法和功能。 ... [详细]
  • Explore how Matterverse is redefining the metaverse experience, creating immersive and meaningful virtual environments that foster genuine connections and economic opportunities. ... [详细]
  • UNP 第9章:主机名与地址转换
    本章探讨了用于在主机名和数值地址之间进行转换的函数,如gethostbyname和gethostbyaddr。此外,还介绍了getservbyname和getservbyport函数,用于在服务器名和端口号之间进行转换。 ... [详细]
  • RecyclerView初步学习(一)
    RecyclerView初步学习(一)ReCyclerView提供了一种插件式的编程模式,除了提供ViewHolder缓存模式,还可以自定义动画,分割符,布局样式,相比于传统的ListVi ... [详细]
  • 深入探讨CPU虚拟化与KVM内存管理
    本文详细介绍了现代服务器架构中的CPU虚拟化技术,包括SMP、NUMA和MPP三种多处理器结构,并深入探讨了KVM的内存虚拟化机制。通过对比不同架构的特点和应用场景,帮助读者理解如何选择最适合的架构以优化性能。 ... [详细]
  • 本文介绍如何使用 Android 的 Canvas 和 View 组件创建一个简单的绘图板应用程序,支持触摸绘画和保存图片功能。 ... [详细]
  • 优化SQL Server批量数据插入存储过程的实现
    本文介绍了一种改进的SQL Server存储过程,用于生成批量插入语句。该方法不仅提高了性能,还支持单行和多行模式,适用于SQL Server 2005及以上版本。 ... [详细]
  • 烤鸭|本文_Spring之Bean的生命周期详解
    烤鸭|本文_Spring之Bean的生命周期详解 ... [详细]
  • 本文介绍了在Android项目中实现时间轴效果的方法,通过自定义ListView的Item布局和适配器逻辑,实现了动态显示和隐藏时间标签的功能。文中详细描述了布局文件、适配器代码以及时间格式化工具类的具体实现。 ... [详细]
  • Explore a common issue encountered when implementing an OAuth 1.0a API, specifically the inability to encode null objects and how to resolve it. ... [详细]
  • 本文介绍了如何使用 Spring Boot DevTools 实现应用程序在开发过程中自动重启。这一特性显著提高了开发效率,特别是在集成开发环境(IDE)中工作时,能够提供快速的反馈循环。默认情况下,DevTools 会监控类路径上的文件变化,并根据需要触发应用重启。 ... [详细]
  • 本文探讨了SSDP(简单服务发现协议)和WSD(Web服务发现)协议,特别是SSDP如何通过固定多播地址239.255.255.250:1900实现局域网内的服务自发现功能。文中还详细介绍了SSDP协议的关键操作类型及其应用场景。 ... [详细]
author-avatar
笃志单车小博_801
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有