原文出自:http://my.oschina.net/zengjie/blog/203876
SolrCloud and Replication
SolrCloud与索引复制
Replication ensures redundancy for your data, and enables you to send an update request to any node in the shard. If that node is a replica, it will forward the request to the leader, which then forwards it to all existing replicas, using versioning to make sure every replica has the most up-to-date version. This architecture enables you to be certain that your data can be recovered in the event of a disaster, even if you are using Near Real Time searching.
索引复制确保为你的数据提供了冗余,并且你可以把一个更新请求发送到shard里面的任意一个节点。如果收到请求的节点是replica节点,它会把请求转发给leader节点,然后leader节点会把这个请求转发到所有存活的replica节点上去,他们通过使用版本控制来确保每个replica节点的数据都是最新的版本。SolrCloud的这种结构让数据能够在一个灾难事故之后恢复,即便你正在使用的是一个近实时的搜索系统。
Near Real Time Searching
近实时搜索
If you want to use the NearRealtimeSearch support, enable auto soft commits in your solrconfig.xml file before storing it into Zookeeper. Otherwise you can send explicit soft commits to the cluster as you need.
如果你想要获得近实时搜索的支持,在solrconfig.xml放到ZooKeeper之前打开索引自动softCommit的特性。另外如果你需要的话可以明确的发送一个softCommit请求给集群。
SolrCloud doesn't work very well with separated data clusters connected by an expensive pipe. The root problem is that SolrCloud's architecture sends documents to all the nodes in the cluster (on a per-shard basis), and that architecture is really dictated by the NRT functionality.
如果你的数据分布在一个节点之间传输数据代价非常高的集群中,那么SolrCloud可能不会运行的很好。其根本原因是因为SolrCloud的架构会把文档发送给集群中的所有节点(会在每个shard的节点之间发送),而这种架构实际上是基于近实时功能的。
Imagine that you have a set of servers in China and one in the US that are aware of each other. Assuming 5 replicas, a single update to a shard may make multiple trips over the expensive pipe before it's all done, probably slowing indexing speed unacceptably.
想象一下你有一系列的服务器是在放在中国,还有一些放在美国,并且它们都知道彼此的存在。假设有5个replica节点,一个发送给shard的单独请求在完成之前可能在高代价的连接上传输多次,很可能把索引速度拖慢到一个不可接受的程度。
So the SolrCloud recommendation for this situation is to maintain these clusters separately; nodes in China don't even know that nodes exist in the US and vice-versa. When indexing, you send the update request to one node in the US and one in China and all the node-routing after that is local to the separate clusters. Requests can go to any node in either country and maintain a consistent view of the data.
因此SolrCloud对这种情况的建议是把这些集群分开维护;放在中国的节点不用知道放在美国的节点的存在,反之亦然。当索引的时候,你把更新请求发送到一个放在美国的节点同时也发送到一个放在中国的节点,然后发送之后两个分开的集群之间的节点路由都是在各自集群本地进行的。
However, if your US cluster goes down, you have to re-synchronize the down cluster with up-to-date information from China. The process requires you to replicate the index from China to the repaired US installation and then get everything back up and working.
然而,如果你在美国的集群宕机了,你必须将最新的数据相关信息从中国的机器上重新同步到美国的集群中。这个处理需要你把索引从中国的机器上拷贝到美国的集群中,然后备份好数据就可以继续正常工作了。
Disaster Recovery for an NRT system
近实时系统的灾难恢复
Use of Near Real Time (NRT) searching affects the way that systems using SolrCloud behave during disaster recovery.
使用近实时搜索会影响使用SolrCloud的系统在灾难恢复时候的行为方式。
The procedure outlined below assumes that you are maintaining separate clusters, as described above. Consider, for example, an event in which the US cluster goes down (say, because of a hurricane), but the China cluster is intact. Disaster recovery consists of creating the new system and letting the intact cluster create a replicate for each shard on it, then promoting those replicas to be leaders of the newly created US cluster.
下面所述的处理过程是假设你正在维护一个分开的集群,跟上面所述的情况一样。考虑到如下这个例子,在美国的集群出现了宕机事件(可以说是因为一场飓风),但是在中国的集群却是完好无损的。灾难恢复由以下流程构成,首先创建一个新的系统并且让完整的集群在这个系统里面为每一个shard都创建一个replica节点,然后把这些replica节点全部晋升为新创建的美国集群里面的leader节点。
Here are the steps to take:
如下是需要进行的步骤:
- Take the downed system offline to all end users.
- Take the indexing process offline.
- Repair the system.
- Bring up one machine per shard in the repaired system as part of the ZooKeeper cluster on the good system, and wait for replication to happen, creating a replica on that machine. (SoftCommits will not be repeated, but data will be pulled from the transaction logs if necessary.)
Icon
SolrCloud will automatically use old-style replication for the bulk load. By temporarily having only one replica, you'll minimize data transfer across a slow connection.
- Bring the machines of the repaired cluster down, and reconfigure them to be a separate Zookeeper cluster again, optionally adding more replicas for each shard.
- Make the repaired system visible to end users again.
- Start the indexing program again, delivering updates to both systems.
- 使宕机的系统对所有用户来说都变成离线状态。
- 停止提供索引处理服务。
- 修复系统。
- 从待修复系统中每个shard拿一台机器加入到没有问题的系统中作为ZooKeeper集群的一部分,然后等待索引复制的开始,在每台机器上都会创建一个replica节点。(软提交不会被复制,但是如果有必要的话会从事务日志中拉取相关的数据)SolrCloud会自动的使用旧的主从方式来进行索引的批量加载。由于你只是临时的创建一个replica节点,所以通过慢速的连接传输的数据会减少到最少。
- 把这些本来是待修复集群中的机器停掉,然后把它们重新配置成一个分开的ZooKeeper集群,可以为每个shard添加更多的replica节点。
- 让修复的系统对所有终端用户可见。
- 重新启动索引程序,把更新请求同时分发给两个系统。