2019独角兽企业重金招聘Python工程师标准>>>
Why Drill
Top 10 Reasons to Use Drill
1. Get started in minutes
It takes a couple of minutes to start working with Drill. Untar the Drill software on your Mac or Windows laptop and run a query on a local file. No need to set up any infrastructure or to define schemas. Just point to the data, such as data in a file, directory, HBase table, and drill.
$ tar -xvf apache-drill-
$
0: jdbc:drill:zk=local> SELECT * FROM cp.`employee.json` LIMIT 5;
+--------------+----------------------------+---------------------+---------------+--------------+----------------------------+-----------+----------------+-------------+------------------------+----------+----------------+----------------------+-----------------+---------+-----------------------+
| employee_id | full_name | first_name | last_name | position_id | position_title | store_id | department_id | birth_date | hire_date | salary | supervisor_id | education_level | marital_status | gender | management_role |
+--------------+----------------------------+---------------------+---------------+--------------+----------------------------+-----------+----------------+-------------+------------------------+----------+----------------+----------------------+-----------------+---------+-----------------------+
| 1 | Sheri Nowmer | Sheri | Nowmer | 1 | President | 0 | 1 | 1961-08-26 | 1994-12-01 00:00:00.0 | 80000.0 | 0 | Graduate Degree | S | F | Senior Management |
| 2 | Derrick Whelply | Derrick | Whelply | 2 | VP Country Manager | 0 | 1 | 1915-07-03 | 1994-12-01 00:00:00.0 | 40000.0 | 1 | Graduate Degree | M | M | Senior Management |
| 4 | Michael Spence | Michael | Spence | 2 | VP Country Manager | 0 | 1 | 1969-06-20 | 1998-01-01 00:00:00.0 | 40000.0 | 1 | Graduate Degree | S | M | Senior Management |
| 5 | Maya Gutierrez | Maya | Gutierrez | 2 | VP Country Manager | 0 | 1 | 1951-05-10 | 1998-01-01 00:00:00.0 | 35000.0 | 1 | Bachelors Degree | M | F | Senior Management |
第一个原因无非是说drill安装简单,同时支持多了数据源如文件,hbase等其他数据源。
2. Schema-free JSON model
Drill is the world's first and only distributed SQL engine that doesn't require schemas. It shares the same schema-free JSON model as MongoDB and Elasticsearch. No need to define and maintain schemas or transform data (ETL). Drill automatically understands the structure of the data.
这个个人其实还是比较喜欢的一个功能,对于之前的hive,hbase等系统来说,在查询数据之前,我们必须要去定义table的元数据信息。而drill本身是不保存元数据信息的,元数据信息是保存在本身的底层存储中,这样使得drill其实更可以关注与自身的查询性能。
3. Query complex, semi-structured data in-situ
Using Drill's schema-free JSON model, you can query complex, semi-structured data in situ. No need to flatten or transform the data prior to or during query execution. Drill also provides intuitive extensions to SQL to work with nested data. Here's a simple query on a JSON file demonstrating how to access nested elements and arrays:
SELECT * FROM (SELECT t.trans_id,t.trans_info.prod_id[0] AS prod_id,t.trans_info.purch_flag AS purchasedFROM `clicks/clicks.json` t) sq
WHERE sq.prod_id BETWEEN 700 AND 750 ANDsq.purchased = 'true'
ORDER BY sq.prod_id;
支持复杂的sql查询
4. Real SQL -- not "SQL-like"
Drill supports the standard SQL:2003 syntax. No need to learn a new "SQL-like" language or struggle with a semi-functional BI tool. Drill supports many data types including DATE, INTERVALDAY/INTERVALYEAR, TIMESTAMP, and VARCHAR, as well as complex query constructs such as correlated sub-queries and joins in WHERE clauses. Here is an example of a TPC-H standard query that runs in Drill "as is":
TPC-H query 4
SELECT o.o_orderpriority, count(*) AS order_count
FROM orders o
WHERE o.o_orderdate >= date '1996-10-01'AND o.o_orderdate
之前的hive或者是建立在hbase上的phoenix其实本质都是sql-like,无法兼容sql92,使得很多sql其实是需要一些改造的。drill就是完全支持sql。
5. Leverage standard BI tools
Drill works with standard BI tools. You can use your existing tools, such as Tableau, MicroStrategy, QlikView and Excel.
其实由于支持jdbc,odcb,同时具备低延迟的特性使得drill完全可以对接BI工具。
6. Interactive queries on Hive tables
Apache Drill lets you leverage your investments in Hive. You can run interactive queries with Drill on your Hive tables and access all Hive input/output formats (including custom SerDes). You can join tables associated with different Hive metastores, and you can join a Hive table with an HBase table or a directory of log files. Here's a simple query in Drill on a Hive table:
SELECT `month`, state, sum(order_total) AS sales
FROM hive.orders
GROUP BY `month`, state
ORDER BY 3 DESC LIMIT 5;
支持hive不解释
7. Access multiple data sources
Drill is extensible. You can connect Drill out-of-the-box to file systems (local or distributed, such as S3, HDFS and MapR-FS), HBase and Hive. You can implement a storage plugin to make Drill work with any other data source. Drill can combine data from multiple data sources on the fly in a single query, with no centralized metadata definitions. Here's a query that combines data from a Hive table, an HBase table (view) and a JSON file:
SELECT custview.membership, sum(orders.order_total) AS sales
FROM hive.orders, custview, dfs.`clicks/clicks.json` c
WHERE orders.cust_id = custview.cust_id AND orders.cust_id = c.user_info.cust_id
GROUP BY custview.membership
ORDER BY 2;
这个其实还是略屌的功能,使得管理员无需在对数据进行集中化的存储。drill支持从多数据源进行数据的join查询。
8. User-Defined Functions (UDFs)
Drill exposes a simple and high-performance Java API to build custom functions (UDFs and UDAFs) so that you can add your own business logic. If you have already built UDFs in Hive, you can reuse them with Drill with no modifications. Refer to [Developing Custom Functions]((/docs/develop-custom-functions/) for more information.
UDF功能,不解释,用过hive应该都知道的。其实udf功能还是必须的,因为很多时候对数据的复杂处理是在udf中的。
9. High performance
Drill is designed from the ground up for high throughput and low latency. It doesn't use a general purpose execution engine like MapReduce, Tez or Spark. As a result, Drill is flexible (schema-free JSON model) and performant. Drill's optimizer leverages rule- and cost-based techniques, as well as data locality and operator push-down, which is the capability to push down query fragments into the back-end data sources. Drill also provides a columnar and vectorized execution engine, resulting in higher memory and CPU efficiency.
高性能,其实drill的定位应该是类似OLAP的查询或者是即席查询,低延迟。当然我们也无法给出究竟多少秒算低延迟。其实我们可以看见低延迟的关键几点技术:本地化,操作下发。大数据集群就是要避开移动数据,尽量本地化,减少网络IO,而操作下发,其实可以理解为整个集群是一个金字塔,任务一层一层向下处理,结果在向上一层一层回送聚合。这个有点google的bigquery的感觉。
10. Scales from a single laptop to a 1000-node cluster
Drill is available as a simple download you can run on your laptop. When you're ready to analyze larger datasets, deploy Drill on your Hadoop cluster (up to 1000 commodity servers). Drill leverages the aggregate memory in the cluster to execute queries using an optimistic pipelined model, and automatically spills to disk when the working set doesn't fit in memory.
这个应该是支持的,hadoop据说都可以到5000,不过觉得gc的不断优化,应该可以突破。
========================================================华丽分割线=====================================================
传送门:http://drill.apache.org/docs 有兴趣的可以看看google的dreml的论文