Oracle Sean

View Original

How not to find Data Guard gaps

I've recently had several encounters with a query for checking for archive log gaps. It appears on multiple blogs, and I suspect it may originate from MOS note 1542907.1, "Standby MRP is looking for very old archive log sequence even after restore new incremental backup at standby site." The query in question:

See this content in the original post

This may sometimes show correct results, but isn’t a reliable way of getting gaps from a standby. To understand why its output is misleading and sometimes wrong, let's break down what it does.

The first problem is it will report different results on the primary and standby, depending on the number of configured destinations.

Primary with three log_archive_dest set:

See this content in the original post

Standby with two log_archive_dest:

See this content in the original post

Standby with one log_archive_dest:

See this content in the original post

Having multiple destinations isn't unusual, so seeing multiple rows in the output suggests the authors didn’t anticipate more complex environments, didn't understand the data, or didn't test this across different environments. Each line in the output represents a different log destination. Do two rows mean that there are two standby databases, and both are in sync? No!

I have another issue with this. Working on critical issues should be as foolproof as possible. A query that produces different results on different hosts means people who may already be dealing with cognitive overload (phone calls, chat and text messages, emails, alerts from monitoring systems, listening to and answering questions on a bridge, people camped out in their cubicle) must remember this query has to run on a particular system. The different results further complicate things. In these situations, I prefer something that works reliably and provides identical results no matter where it's executed.

Pet peeve: messy code

The query is based on two, similar subqueries against v$archived_log and v$log_history:

See this content in the original post

The output of the ARCH subquery (from v$archived_log) represents the "Last Sequence Received".
The output of the APPL subquery (from v$log_history) represents the "Last Sequence Applied".

Both look at the maximum value of FIRST_TIME to limit the result to a single sequence per thread. But… FIRST_TIME is the timestamp of the first redo change (SCN) and corresponds to the SCN in FIRST_CHANGE#. The problem? This column has nothing to do with redo transport or apply! A single thread/sequence can have multiple rows in v$archived_log—one for each log destination—but the FIRST_TIME is always the same for all, regardless of whether it's been shipped or applied on a standby!

Since FIRST_TIME doesn't differentiate records, the outermost select against the subquery is redundant. They can be simplified to:

See this content in the original post

I'm a code minimalist. While the results are identical, code should be as simple as possible, especially when used during critical events—like disaster recovery. The extra code adds noise and makes it harder to read (and understand).

OK, so duplicate rows and some extra SQL aren't the end of the world. Let's look at the far more significant problem: wrong results!

There was a gap, but there was no gap

I mentioned that this query appears in MOS and multiple blog posts. I haven't looked at all of them, but I haven’t seen any guidance regarding where to run this. Logically, it should run on the standby because it uses v$log_history, which tells us nothing about logs on the standby (more on why later.) However, running it on a standby guarantees situations where it will not report a gap, even when there is a gap.!

In this case, the query showed no gap. DBAs used output from v$managed_standby to further “confirm” that log shipping working:

See this content in the original post

If you aren’t familiar with the output of v$managed_standby, this might suggest the system is healthy. Yet things were very, very Not Good. Redo wasn't shipping to the standby, thanks to a bad connection string in the Data Guard Broker configuration.

Why did the query show a gap of 0? It has no information to the contrary, and will thus never report a gap in this scenario, thanks to the nature of data in these views.

The "Last Received Sequence" check

Following the event, I selected additional columns from v$archived_log for three sequences:

  • 2093, shown in the v$managed_standby output above, with a status of CLOSING;
  • 2158, the sequence reported as current, with no gap;
  • 2272, the (then) latest log sequence.
See this content in the original post

Results on the primary:

See this content in the original post

And, on the standby:

See this content in the original post

Notice the following:

  • These results confirm that FIRST_TIME is identical for every thread/sequence on both systems and doesn't vary based on whether logs were shipped or applied.
  • On the primary, each sequence has three records—one for each log destination. Destinations 1 and 3 are local. Destination 2 is the standby.
  • On the standby, there are two records per sequence. Destination 1 is FRA. Destination 3 is a local directory.
  • The COMPLETION_TIMEs at the primary for sequence 2093 are identical at the local destinations, but delayed by ~17.5 hours at the standby.
  • The COMPLETION_TIMEs at the standby do not match the primary.
  • The FAL column on the primary—indicating whether the standby issued a FAL (Fetch Archive Log) request to the primary—is YES for the standby destinations of sequences 2093 and 2158.
    • FAL requests originate at the standby database and indicate that the log wasn't available locally (eg, hadn't been shipped as expected). A FAL request asks the primary (or other upstream participant in a Data Guard configuration) to re-send the missing file.
    • Seeing FAL requests for sequences 2093 and 2158 means those logs were missing on the standby.
    • Once the configuration was corrected and the primary and standby resumed communication, the standby identified the missing sequences and requested them from the primary.
  • Look closely at the RECID values. These reflect the order that log records were created.
    • They're sequential on the standby, but on the primary, the RECID for the standby destinations are significantly delayed for sequence 2093.
    • The record for shipping sequence 2093 was also created after the record for sequence 2158.
    • The standby destination COMPLETION_TIME for sequence 2158 is also earlier than the completion time for sequence 2093—consistent with the RECID values.
  • The completion time and RECID of the standby destination row for sequence 2272—created after the configuration was corrected—are earlier than the times for the local destinations, and earlier still on the standby than the primary.

Based on the contents of the view on the primary, determining whether a log was received via v$archived_log would require—at minimum—confining results to rows with STANDBY=YES and APPLIED=NO.

Determining whether a sequence was received but not applied on a standby is a bit more difficult. We could filter on APPLIED=NO, but that doesn't consider the possibility of multiple destinations. We might look to v$archive_dest, but in this case, the only thing different about destinations 1 and 3 are the values of DESTINATION. One is set to USE_DB_RECOVERY_FILE_DEST, while the other is a directory. It's dangerous to assume standby apply always runs on DEST_ID=1 or DESTINATION=USE_DB_RECOVERY_FILE_DEST, and this is precisely the sort of thing that leads to flawed conclusions during a crisis!

The "Last Applied Sequence" check

Recall that the parent query gets its "Last Sequence Applied" value from v$log_history:

See this content in the original post

Here the output for the three sequences:

See this content in the original post

This view has a single entry for each thread/sequence because it reflects only that a sequence exists—the history. It has no detail and nothing identifying whether a log was applied! The value of the "Last Applied Sequence" (on the standby) will always be the latest received sequence.

That’s why this query is probably intended to run on the standby, and also why it should never be run on the standby. The sequence in v$log_history at the primary is definitive. v$log_history on the standby only knows what it’s received.

But if log shipping isn't working, the sequence in the standby’s v$log_history is the last sequence it registered. It can't know about anything more recent, and if it isn't communicating with the primary, the last sequences in v$archived_log and v$log_history will always match, and the gap will always be 0!

(Note that the v$log_history view also can’t tell us if a sequence is applied, only that it was received and registered with the database. Using this view to get the “Last Applied Sequence” is flawed. That could only, perhaps, come from v$archived_log.)

That's why gap checks shouldn’t run from standby databases. If contact is broken, these views are still populated, but not current. Processes will probably still be running, and there may be no outward indication of a problem.

Do this instead

v$archived_log is not an authoritative source for identifying gaps in Data Guard. If the primary can't reach one or more standby databases, or if there are errors in transport or apply, the presence of a sequence doesn’t mean data is moving.

We might be able to refine the original query, adding checks against the APPLIED and STANDBY columns, but consider other factors that complicate attempts to find gaps in Data Guard:

  • The protection mode
  • The sync mode
  • Multiple, alternate, and deferred destinations
  • Destination priorities
  • The log_archive_min_succeed_dest setting
  • The archive_lag_target setting
  • FarSync
  • Cascading standbys

It’s a lot to consider. Writing a query to report accurate gaps from Data Guard—under every situation—is difficult. Fortunately, we don't need to write anything. Oracle built a check into the Data Guard Broker, and there’s a dedicated view that reports gaps, as well as shipping and apply progress:

See this content in the original post

The Broker provides identical results on any member of a Data Guard configuration. If something is wrong, it doesn’t work or returns errors.

The v$archived_dest_status view has a column that reports the gap status, archived and apply sequences, and other information:

See this content in the original post

This view can be joined with v$archive_dest, v$archived_log, v$thread, v$database, v$instance, and others (or any of their gv$ counterparts) to paint a more complete picture of Data Guard activity. However, this only works from the primary. When run on the standby, the GAP_STATUS is NULL.

What about v$managed_standby?

If you've made it this far, congratulations! I want to revisit one last piece of this puzzle that added confusion for the DBAs working on this issue.

Remember that query against v$managed_standby? It seemed to confirm that Data Guard was working because the SEQUENCE# of the MRP process was advancing:

See this content in the original post

Note the following:

  • All but two processes are attached to (the nonexistent) thread 0.
  • The MRP0 (Managed Standby Recovery Process) process reads redo and applies it on the standby, and the values are increasing.
  • However the ARCH process didn't advance—both results showed the same sequence, 2093, in a CLOSING status.
  • There is only one ARCH process for thread 1. We expect to see one per log group.
  • The RFS (Remote File Server) processes on the standby receive incoming redo and write to the standby redo logs. Both are idle and assigned to thread 0.

This reveals the danger of ruling out a problem by using something that doesn’t clearly state a condition or error, or that you aren’t familiar enough with to (instantly) separate normal from abnormal.

After correcting the configuration, the output of the healthy system has multiple ARCH and RFS processes:

See this content in the original post

After a log switch, the ARCH process starts writing the new, incoming log. RFS begins real-time apply for the new log. MRP starts applying the new log sequence:

See this content in the original post

v$archived_log shows apply running IN-MEMORY when the database is in Real-Time Apply mode (these results are for slightly different sequences—sorry):

See this content in the original post

The newly archived log switches from WRITING to CLOSING, while RFS changes to RECEIVING:

See this content in the original post

RFS returns to an IDLE state as it waits for the next log, and MRP continues the IN-MEMORY log apply:

See this content in the original post

I don't consider v$managed_standby a good source for confirming Data Guard is shipping and applying logs. There are no clear errors or exceptions; during a critical event, the output takes more time to review and process. I prefer error messages, binary conditions, or non-zero values that unequivocally indicate problems. (It is excellent for investigating issues with Data Guard processes, though.) However, if you aren't seeing sequences advancing for the ARCH processes, or if there's a gap between the sequences connected to the MRP and ARCH processes, there's a good chance that log shipping or redo apply isn't healthy!