作者:imba-Y_685 | 来源:互联网 | 2023-09-10 13:48
IhaveanETLprocessthatinvolvesastoredprocedurethatmakesheavyuseofSELECTINTOstatement
I have an ETL process that involves a stored procedure that makes heavy use of SELECT INTO
statements (minimally logged and therefore faster as they generate less log traffic). Of the batch of work that takes place in one particular stored the stored procedure several of the most expensive operations are eager spools that appear to just buffer the query results and then copy them into the table just being made.
我有一个ETL过程涉及到一个存储过程,使得大量使用SELECT INTO语句(最小记录,因此速度时产生的日志流量)。批工作,发生在一个特定的存储存储过程的一些最昂贵的操作是热切的线轴,似乎只是缓冲查询结果,然后将它们复制到桌子。
The MSDN documentation on eager spools is quite sparse. Does anyone have a deeper insight into whether these are really necessary (and under what circumstances)? I have a few theories that may or may not make sense, but no success in eliminating these from the queries.
MSDN文档渴望线轴很稀疏。有人有一个更深的洞察这些是否真的有必要(和在什么情况下)?我有一些理论可能或不可能有意义,但没有成功的消除这些查询。
The .sqlplan files are quite large (160kb) so I guess it's probably not reasonable to post them directly to a forum.
sqlplan文件非常大(160kb),所以我认为直接将它们发布到论坛可能是不合理的。
So, here are some theories that may be amenable to specific answers:
所以,这里有一些理论可以用来回答特定的问题:
- The query uses some UDFs for data transformation, such as parsing formatted dates. Does this data transformation necessitate the use of eager spools to allocate sensible types (e.g. varchar lengths) to the table before it constructs it?
- 数据的查询使用一些udf转换,如解析格式的日期。这个数据转换需要急切的线轴的使用分配合理的类型(比如varchar长度)的前结构吗?
- As an extension of the question above, does anyone have a deeper view of what does or does not drive this operation in a query?
- 作为上述问题的扩展,是否有人对查询中驱动或不驱动这个操作有更深入的了解?
1 个解决方案
26
My understanding of spooling is that it's a bit of a red herring on your execution plan. Yes, it accounts for a lot of your query cost, but it's actually an optimization that SQL Server undertakes automatically so that it can avoid costly rescanning. If you were to avoid spooling, the cost of the execution tree it sits on will go up and almost certainly the cost of the whole query would increase. I don't have any particular insight into what in particular might cause the database's query optimizer to parse the execution that way, especially without seeing the SQL code, but you're probably better off trusting its behavior.
我对spooling的理解是,这在你的执行计划上有点转移注意力。是的,它占了查询成本的很大一部分,但它实际上是SQL Server自动执行的优化,这样可以避免代价高昂的重新扫描。如果要避免假脱机,那么它所在的执行树的成本将会上升,几乎可以肯定的是,整个查询的成本将会增加。我没有特别深入地了解什么可能会导致数据库的查询优化器以这种方式解析执行,特别是在没有看到SQL代码的情况下,但是您最好信任它的行为。
However, that doesn't mean your execution plan can't be optimized, depending on exactly what you're up to and how volatile your source data is. When you're doing a SELECT INTO
, you'll often see spooling items on your execution plan, and it can be related to read isolation. If it's appropriate for your particular situation, you might try just lowering the transaction isolation level to something less costly, and/or using the NOLOCK
hint. I've found in complicated performance-critical queries that NOLOCK
, if safe and appropriate for your data, can vastly increase the speed of query execution even when there doesn't seem to be any reason it should.
然而,这并不意味着你不能优化执行计划,完全取决于你做什么和不稳定你的源数据。当你做一个选择,你会经常看到假脱机项目执行计划,并且它可以读隔离。如果它是适合您的特定情况下,你可能会降低一些成本更低的事务隔离级别,和/或使用NOLOCK提示。我发现在NOLOCK性能关键型的复杂查询,如果安全,适合您的数据,可以大大提高查询的速度执行即使似乎没有任何理由。
In this situation, if you try READ UNCOMMITTED
or the NOLOCK
hint, you may be able to eliminate some of the Spools. (Obviously you don't want to do this if it's likely to land you in an inconsistent state, but everyone's data isolation requirements are different). The TOP
operator and the OR
operator can occasionally cause spooling, but I doubt you're doing any of those in an ETL process...
在这种情况下,如果您尝试读UNCOMMITTED或NOLOCK提示,您可能可以消除一些假消息。(显然,如果可能使您处于不一致的状态,您不希望这样做,但是每个人的数据隔离需求都是不同的)。顶级操作符和OR操作符有时会导致假脱机,但我怀疑您在ETL过程中所做的任何事情……
You're right in saying that your UDFs could also be the culprit. If you're only using each UDF once, it would be an interesting experiment to try putting them inline to see if you get a large performance benefit. (And if you can't figure out a way to write them inline with the query, that's probably why they might be causing spooling).
您说得对,您的udf也可能是罪魁祸首。如果您只使用每个UDF一次,那么尝试将它们内联,看看是否能获得较大的性能收益将是一个有趣的实验。(如果您无法找到一种方法将它们内联地编写到查询中,这可能就是它们可能导致假脱机的原因)。
One last thing I would look at is that, if you're doing any joins that can be re-ordered, try using a hint to force the join order to happen in what you know to be the most selective order. That's a bit of a reach but it doesn't hurt to try it if you're already stuck optimizing.
我要研究的最后一件事是,如果您正在执行任何可以重新排序的连接,请尝试使用一个提示来强制连接顺序按照您所知道的最选择性的顺序执行。这有点难度,但如果你已经陷入了优化,尝试一下也无妨。