DFS Replication is really cool, it tends to work really well for us and it has saved our bacon many times already when used in combination to Namespaces.
This week, one of our servers went down (Window Server 2012 R2) unexpectedly due to a blue screen error which turned out to be memory related. The result, I had a bit of a nightmare on my hands as I had to patiently wait for RAID to rebuild then Replication stopped working after that. Luckily, our other replicated server kept things going whilst I worked on the fault but It seemed that the replication had managed to somehow become corrupted and no longer function.
I kept seeing logged events in the event log:
The DFS Replication service stopped replication on the replicated folder at local path W:\DeploymentShare. Additional Information: Error: 9098 (A tombstoned content set deletion has been scheduled) Additional context of the error: Replicated Folder Name: DeploymentShare Replicated Folder ID: 3B11214C-1B97-44D4-B5A3-27563F64007B Replication Group Name: DeploymentShare Replication Group ID: 40EAC201-204F-44E9-95F3-2B1810B4958C Member ID: 486E2ABE-3275-4E4B-9CC7-8C1911EA47E4
and this one also:
The DFS Replication service stopped replication on the replicated folder at local path M:\DeploymentShare. Additional Information: Error: 9073 (Content set initialization is pending journal wrap task to resume journal read) Additional context of the error: Replicated Folder Name: DeploymentShare Replicated Folder ID: 8F880E6D-F6E6-42A5-8E29-7F07B9DC7D73 Replication Group Name: Deployment Share Replication Group ID: 06C43A27-6E9F-408E-AD5B-2A69E69A4F79 Member ID: 76F9B906-3323-450A-A607-6E57B7A4CC1F
I could also see that the DfsrPrivate\Staging folder had a number of files in the Content Set but nothing was replicating. I created a simple text document on Server A and it never arrived at Server B.
In the DFS Management console, I created a Diagnostic Healthreport and it reported the same errors found in the event log.
Here is how I (eventually) resolved this matter without losing any data:
Open a command prompt on both Member servers
Type in to each server:
NET STOP DFSR
This will stop the replication service from trying to replicate.
Next, go into Explorer on both servers and show hidden files.
Go into the Disks that contain the Replicated folders (i.e. W:\ Drive)
Right click on “System Volume Information” and select “Properties” from the context menu.
Go to the “Security” tab and click “Edit…”
We need to give ourselves access to this folder so click on “Add…”
Type in your administrative user name or simply use “Domain Admins” if you choose.
Tick the “Full Control” -> “Allow” check box and click “OK”
Click “OK” again to return to Explorer.
Next, we need to return to our CMD window and type the following:
rmdir "W:\System Volume Information\DFSR" /s
This will remove the DFS Replication database information for this drive. Doing this will force DFS to re-generate a new set.
Note: If this command reports any errors about filenames being too long, you may need to delete files manually using a filemanager that is able to delete file paths longer than 255 chars. I used 7-Zip’s File Manager which is handy for doing this. In 7-Zip, browse to where the folder is stored and hold SHIFT whilst clicking Delete. That folder should now delete ok.Once these folders have been removed from both Member servers, we can go ahead and start the DFSR services again. In our CMD prompt, type:
NET START DFSR
Watch the event logs! You should see something along these lines within about 10-15 mins:
The DFS Replication service initialized the replicated folder at local path W:\DeploymentShare and is waiting to perform initial replication. The replicated folder will remain in this state until it has received replicated data, directly or indirectly, from the designated primary member. Additional Information: Replicated Folder Name: DeploymentShare Replicated Folder ID: 8F880E6D-F6E6-42A5-8E29-7F07B9DC7D73 Replication Group Name: Deployment Share Replication Group ID: 06C43A27-6E9F-408E-AD5B-2A69E69A4F79 Member ID: 79E1C694-787F-47A5-9566-AE087FC4F7F3
You should also start to see items re-appearing in the DfsrPrivate\Staging folder in those 15 mins.
This worked like this for one of my Replication Groups, but one other troublesome replication group still didn’t propagate the staging folder after 30 mins of waiting so here is what I had to do:
First of all, I checked to see if DFS knew which Member was Primary by typing in the following command in our CMD prompt on one of the member servers:
dfsradmin membership list /RGName:"Distribution Share" /attr:ALL >%USERPROFILE%\Desktop\members.txt
This command kindly dropped a text file on my server desktop and I noticed that in the “IsPrimary” field, neither of my member servers were Primary. Both were saying “No”.
So then I nominated the surviving member server (not the one that crashed) to be the Primary by typing in the following command:
dfsradmin membership set /RGName:"Distribution Share" /RFName:"Distribution" /MemName:WDS2 /IsPrimary:Yes
Once DFS found a primary member for the Replication Group, it pretty much immediately started filling the Staging area with files and things started moving along nicely. We also got a very promising event logged in the event log:
The DFS Replication service initialized the replicated folder at local path W:\Distribution and is waiting to perform initial replication. The replicated folder will remain in this state until it has received replicated data, directly or indirectly, from the designated primary member. Additional Information: Replicated Folder Name: Distribution Replicated Folder ID: 8C7E9497-452F-4574-9B4F-23759686392A Replication Group Name: Distribution Share Replication Group ID: A064B665-2F5B-4ED6-9C4A-236043697A79 Member ID: 82C9C73E-6882-4E3B-BDA7-A2E09213450E
I could then see a bit later that things were going very well (albeit I maybe need to review my staging quota) when this event was logged:
The DFS Replication service has detected that the staging space in use for the replicated folder at local path W:\Distribution is above the high watermark. The service will attempt to delete the oldest staging files. Performance may be affected. Additional Information: Staging Folder: W:\Distribution\DfsrPrivate\Staging\ContentSet{8C7E9497-452F-4574-9B4F-23759686392A}-{82C9C73E-6882-4E3B-BDA7-A2E09213450E} Configured Size: 10240 MB Space in Use: 9216 MB High Watermark: 90% Low Watermark: 60% Replicated Folder Name: Distribution Replicated Folder ID: 8C7E9497-452F-4574-9B4F-23759686392A Replication Group Name: Distribution Share Replication Group ID: A064B665-2F5B-4ED6-9C4A-236043697A79
Thankfully, our replication is now back to full operation again. Did this help you? I’d love to hear from you. Please comment below.
Rick