Jon Russell

2011/08/08

Replication Agent Failure

I recently spent several days investigating a replication failure. Finding the answer specific to my problem was difficult, so I thought I would post my findings and hope someone else may find it useful.

The day after installing a few security patches, replication to the subscriber database began to fail. The Distribution Agent had the following error message:

Executed as user: <UserAccount>. Replication-Replication Distribution Subsystem: agent <AgentName> failed. Executed as user: <UserAccount>. A required privilege is not held by the client. The step failed. (Error 14151). The step failed.

This message is usually caused by changing the SQL Server service using the Windows Service Control manager, which cannot grant the required permissions to start the service appropriately or to run SQL Agent jobs. SQL Server Configuration Manager should be used instead and the correct way to fix it is to set the service account to the Local System account, then back to the domain account using the SQL Server Configuration Manager. (http://support.microsoft.com/kb/911305/)

Since I did not change the service account recently, I thought this could not be the correct answer to the problem. But since I did get the same error message, I thought it would not hurt to restart the service with SQL Server Configuration Manger anyway. Of course, this did not work, but error message does mean it was security related.

Then I changed all the replication agents to have the maximum privileges possible. This did work either. This led me to think that this was not related to the articles themselves and maybe not even to Replication or SQL Server. I found the following error message in Event Viewer:

Log Name: System
Source: Security-Kerberos
Event ID: 4

The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server <servername>$. The target name used was <servername>$. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Please ensure that the target SPN is registered on, and only registered on, the account used by the server. This error can also happen when the target service is using a different password for the target service account than what the Kerberos Key Distribution Center (KDC) has for the target service account. Please ensure that the service on the server and the KDC are both updated to use the current password. If the server name is not fully qualified, and the target domain (<domainname.com>) is different from the client domain (<domainname.com>), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.

This error message occurs when two or more computer accounts have the same SPN registered.
(http://support.microsoft.com/kb/321044)

To fix this we deleted the computer account entries in AD, then disjoined and rejoined the server from the domain. Except for a few expired articles, all replication articles synchronized on their own without intervention. I reinitialized the expired articles.

This fixed our problem so far, but if anyone has any experience with this your feedback is welcomed.

Regards,

Jon

2011/05/20

Managing Data Change

Often DBAs receive requests to manually manipulate production data. Ideally, all data changes should be performed through the appropriate application or interface. Unfortunately, sometimes applications lack a feature or has a bug that does not allow the user to modify data. To fix a bug or add a feature may require another software release. Sometimes the business need is too great to wait for the next release.

Manually manipulating data can pose various problems for both the DBA who designed the database and knows everything about the schema, as well as the DBA who recently inherited a database and knows nothing about the schema. Some of these problems are:

Data cannot simply be changed whenever we want, to any value we want. Most tables are related to other tables. Some data is designed to represent a record’s state at a certain point in time. Changing a value may break that relationship or state and, therefore, break the application and corrupt data.
In most cases with a well designed and documented database that has all the appropriate keys and constraints, a DBA can quickly determine if modifying data will break the application. In the real world, this is often not the case. The schema does not necessary reveal the application logic, or how the data is used or modified. Sometimes only the original developer or experienced analyst knows the application well enough.
Changing data manually is very labor intensive. Also, DBA and developer labor is not cheap. Because it is labor intensive, continuing to manually manipulating will always become more expensive at some point in the future than coding the application appropriately.
Changing data manually is error prone - Despite their appearance, DBAs are human. Even a highly skilled DBA with years of experience can write an UPDATE statement incorrectly. Also, the DBA can be given the wrong record to update.

If requests for data modification are frequent, policy and procedures to manage data change are needed. These policies and procedures should have the following objectives:

Retain old data - Retain data prior to the change so the changes can be tracked and reversed, if needed.
Document Purpose - Documenting the purpose of the data change in order build a case for a software update. By doing this you may find that you are updating the same column on the same table several times a month. Now you will have the documentation to prove it and why.
Notify Interested Parties - There may be other interested parties in your company that need to be aware of the change. Operations managers may not want you changing records during the day or after an order is shipped. Software Development and Product managers may want to know about a missing feature or bug in the software.
Only certain data can be changed - Because the database does not necessarily reveal the application logic and table relationships, only approved tables and columns should be changed.
Test all changes - All changes should first be done in a development environment.
Verify changes - All changes are verified in a QA or testing environment.
Use the best person to develop an update script - The DBA may not be the best person to develop an update script. Sure, he or she may be the best skilled, but as I mentioned above, the DBA may not completely understand the application logic or business implications. In such cases the developer, maybe with the help of a business analyst, can develop an appropriate update script.

How does your company manage data change requests? Besides my list above, what other problems do you see with manually manipulating data? Can you think of other objectives a data change request policy should have?

2011/05/02

Poor Man’s SQL Monitor

Here is quick tip that can help you track SQL Job and Alert notifications in Microsoft Outlook. You may already do this, but if you arrange your inbox by conversation, instead of date, emails will grouped by subject. This will enable you to easily “arrow down” through the group and quickly compare messages.

Look at the standard a SQL Server Alert notification as an example:

Subject: SQL Server Alert System: 'Batch Requests/sec
DATE/TIME: 5/2/2011 5:35:01 AM
DESCRIPTION: The SQL Server performance counter 'Batch Requests/sec' (instance 'N/A') of object 'SQLServer:SQL Statistics' is now above the threshold of 1000.00 (the current value is 1077.30).

By sorting through several of these “Batch Requests/sec“ messages you can quickly compare important information such as the date/time and the current Batch Request/sec value.

In my particular case, I received this email exactly every 30 minutes. I may have found this pattern anyway, but by arranging my inbox by conversation, it was quick and easy to spot this pattern. The exact time pattern suggested some type scheduled job or process, which is exactly what this problem turned out to be.

2011/04/29

Managing your Backup Schedules

Have you ever noticed that your backup job duration is growing faster than your database. This just recently happened to me so I thought I would share this.

Not managing your backup schedules could cause you several problems later on. This is especially the case when you do not have any centralized backup management software, such as Red Gate's SQL Backup. Managing backups is fundamental for a database administrator, but I believe revisiting the basics is good practice, especially of your are a new DBA.

For the past several months, I have noticed the duration of several backups taking up to 50% longer. At first I just assumed since most of these databases are growing, it makes sense the backups would take longer too, so I did not do anything about it. Eventually, I noticed most of these databases had grown only about 15% over the same period. This did not add up to me. I then looked at my backup schedules on the the different servers and noticed that some of them started at the same time and some where not completing before the next one started. These jobs where competing for SQL Server resources, primarily disk contention on the local backup drive as well as the the network drive, where the backup file was later copied. There was also a developer database restore job that runs at the same time other backup jobs were running, causing disk contention on the network drive.

Of course, the answer is to just schedule the backups appropriately, but there is a lesson here. Here are a few things I do to keep my backup schedules from tripping on each other:

Periodically review your backup schedules. If your backup jobs use the same drive, local or network, plan your schedules to work with all jobs on all servers.
Consider other jobs or processes that will be accessing the same drive.
Monitor your database growth. Your backups will grow too.
Monitor your backup duration and adjust your schedules accordingly.
Use a spreadsheet to list all your schedules' for all servers to see if any overlap. List Server Name, Process/Job Name, Start Time, Average Duration, and Average End Time.

Again, this is fairly basic, but I have been a SQL Server Database Administrator for over 9 years and this one got me.

2010/04/16

SQL Server Agent Jobs Not Completing

Problem
Random jobs and SSIS packages started failing, generating the following error message: “DTExec.exe - Application Error : The application failed to initialize properly (0xc0000142). Click on OK to terminate the application.”

Symtoms
SQL Server 2005 Maintenance Jobs that run continuously. The job starts, but never complete. This occurs on all jobs that run a SSIS package. Jobs that only run T-SQL run fine. The only message found is an informational Windows system log that appears every time the job runs:

Event Source: Application Popup
Event ID: 26
Description: Application popup: DTExec.exe - Application Error : The application failed to initialize properly (0xc0000142). Click on OK to terminate the application.

Workaround
This problem went away after I consolidated many SQL Server Replication Jobs. I am not sure why, but I suspect it had something to do with the non-interactive desktop heap (KB 824422).

After searching on the Internet for some time, I came across MS KB 824422 (http://support.microsoft.com/kb/824422) This was close to my problem, but not exactly. The KB did make me think about how many processes where running concurrently. The following week I was getting several replication job errors. We did have a couple hundred SQL Server replication jobs run continuously. Consolidating the replication jobs had been a project I tasked myself with a month earlier. I decided to start that project and see what happens. Sure enough, the errors went away. Keep in mind, I did NOT make any registry changes as the KB suggested. I just reduced the number of concurrent SQL Jobs.

I call this a workaround because I cannot explain exactly how this fixed the problem. Here is script I use to help me consolitate replication articles into fewer publications.

Read my forum post on SQL Server Central. If anyone has experience with this, I would appreciate your feedback.

2010/04/06

Query Execution Performance

This is a useful script to test query execution time and view the IO statistics. This is especially useful for developers to create efficient queries. This utility simply measures the execution time for your query. It also returns IO Statistics, such as scan count, logical reads and more. Just paste in your query and execute.

Download the file here.

I recommend that developers closely analyze not just the execution duration, but also the IO statistics and the execution plan.

To learn more about these topics check out the following resources:

SQL Server Books Online: Analyzing a Query
SQL Server Books Online: Displaying Graphical Execution Plans
SQL Server Books Online: SET STATISTICS IO

Replication Publication Articles

This is a useful script when planning where to place a new publication. I like grouping common articles into one publication to make for easier administration . I also like knowing the row count for those articles.

DECLARESearchStr NVARCHAR(100)
SET @SearchStr = '%Orders%' ;

SELECT p.name AS PublicationName ,
       a.name AS ArticalName ,
       t.row_count AS RowCnt
FROM   dbo.syspublications p
       INNER JOIN dbo.sysarticles a
           ON p.pubid = a.pubid
       LEFT JOIN sys.dm_db_partition_stats t
           ON a.objid = t.object_id
WHERE t.index_id < 2
       AND p.name LIKE @SearchStr
       AND a.name LIKE @SearchStr
ORDER BY p.name , a.name ;

Download the file here.

This is a useful script when planning where to place a new publication. I like grouping common articles into one publication to make for easier administration. I also like knowing the row count for those articles. I developed this simple script to help me consolidate replication articles into fewer publications. In this case, there was one publication per article, and there were a lot of articles (yikes). All publications where push subscriptions. This made the production database do all the work.

Pages