Book Review: Microsoft System Center – Operations Manager Field Experience

Operations-Manager-Field-Experience-247x300 Book Review: Microsoft System Center - Operations Manager Field ExperienceRecently, I finished reading the Microsoft System Center Operations Manager Field Experience eBook.

The chapter(s) that I found most helpful were basically all of them! The entire book is filled with very useful points, tips, and insights.

I’ve decided to share my highlights from reading this specific publication, in case the points that I found of note/interest will be of some benefit to someone else. So, here are my highlights (by chapter). Note that not every chapter will have highlights (depending on the content and the main focus of my work).

 

Chapter 01: The Role of Operations Manager

  • By default, every installation of Operations Manager is not registered; it’s installed as an evaluation version. This is true even if you installed it from volume licensing media. To register your installed environment, you can use the Windows PowerShell cmdlet Set-SCOMLicense
  • Since Operations Manager was designed with built-in high availability when you have two management servers, having two or more management servers is recommended. That way, if one goes down, failover is possible. To determine which management server is down and which is still up and running, the server running the Operational Database serves as a watcher node, similar to a witness in a failover cluster, and has a majority in deciding which one is the functional management server.
  • If the primary management server for the agent goes down, the agent tries to connect to one of the management servers defined as a failover. You can define a failover management server through the console by using AD integration or by using the Set-SCOMParentManagementServer cmdlet with the –FailoverServer parameter.
  • Azure Operational Insights provides you with the combined knowledge of the Microsoft Support Engineers, who are responsible for adding rules to the product. These rules work like an additional management pack that is managed centrally by Microsoft
  • By default, on Windows Server 2008 R2 and higher, power management is set to Balanced. In some cases, you may experience degraded overall performance on a Windows Server machine when running with the default power plan. This is most noticeable on the SQL server running the Operational Database, where the Balanced power setting results in slow console performance since most of the background actions in the console are SQL query-based. The issue may occur irrespective of platform and may be exhibited in both physical and virtual environments
  • Another power management setting to consider is described in the Knowledge Base article “Degraded overall performance on Windows Server 2008 R2” at http://support.microsoft.com/kb/2207548. Note that even though this article describes the problem in the context of Windows Server 2008 R2, the strategies described are also valid for later versions of Windows Server
  • You can find some important information about the power management setting on a network adapter at http://support.microsoft.com/kb/2740020. As stated in the Knowledge Base article, you might want to disable the Allow The Computer To Turn Off This Device To Save Power network adapter power management setting on servers
  • The D drive on an Azure IaaS VM is a temporary disk, using local storage from the actual hardware that is hosting your VM. This means that everything on this drive will be lost in the case of a reboot, so don’t use it to store anything that you want to keep.
  • Find more information about SQL Server at http://blogs.technet.com/b/dataplatforminsider/archive/2014/09/25/using-ssds-in-azure-vms-to-store-sql-server-tempdb-and-buffer-pool-extensions.aspx
  • The general information found in this blog post also applies to SQL Server: http://blogs.msdn.com/b/mast/archive/2014/10/14/configuring-azure-virtual-machines-for-optimal-storage-performance.aspx.
  • To test the speed of your disk subsystem, use the SQLIO Disk Subsystem Benchmark Tool from Microsoft, available at http://www.microsoft.com/en-us/download/details.aspx?id=20163.
  • Putting a Gateway server in a remote subnet to compress the outgoing data is no longer recommended. The agent itself does an equally good job of compressing the data in Operations Manager 2012 R2. However, the other reasons for installing a Gateway server in a remote subnet are still valid, for instance to reduce the administrative overhead and to minimize the number of certificates that are needed. More information can be found at http://technet.microsoft.com/en-us/library/hh212823.aspx.
  • When you install Operations Manager on machines running antivirus software, you should configure the antivirus software so that the following directories are excluded:
    • The Health Service State folder on every management server and every agent
    • The data and log file directories where your databases are located
    • Excluding the actual binary files, such as MonitoringHost.exe, is not recommended.
  • A detailed overview of antivirus exclusions for Operations Manager can be found at http://support.microsoft.com/kb/975931.
  • The best way to configure SQL Server in your Operations Manager environment is to keep it simple. The default settings for Operations Manager should be left alone unless you have very specific reasons to change them
  • Neither auto grow nor auto shrink are recommended for the Operational Database because it needs 50 percent of free space at all times to perform maintenance and indexing tasks. If the database doesn’t have enough free space, the scheduled maintenance tasks might fail. Operations Manager will alert you when there is less than 40 percent of free space.
  • The SQL Server edition you are using also has an important role when you are considering auto grow. SQL Server Standard edition can cause the database tables to lock out when auto grow is configured. However, this does not occur with SQL Server Enterprise edition. This applies to both the Operational Database and the Data Warehouse Database.
  • Auto grow is supported (though not recommended), when enabled as an insurance policy against the database’s file filling up. When using auto grow on the databases, it is better to set it to increase by a fixed amount rather than a percentage. The fixed increase amount should be no more than 500 MB or 1 GB in growth to limit the blocking that might occur during the expansion process. It is also useful to configure a maximum possible size to prevent the databases from filling up the disk they reside on.
  • More information about auto grow and auto shrink can be found in the article at http://support.microsoft.com/kb/315512/.
  • Find more information about instant file initialization at http://blogs.msdn.com/b/sql_pfe_blog/archive/2009/12/23/how-and-why-to-enable-instant-file-initialization.aspx and http://sqlblog.com/blogs/tibor_karaszi/archive/2009/03/09/do-you-have-instant-file-initialization.aspx
  • In SQL Server, data files can be initialized instantaneously. This allows for fast running of file operations. Instant file initialization reclaims used disk space without filling that space with zeros. Instead, disk content is overwritten as new data is written to the files. Log files cannot be initialized instantaneously. Instant file initialization is available only if the SQL Server (MSSQLSERVER) service account has been granted the right to perform volume maintenance tasks (SE_MANAGE_VOLUME_NAME). Members of the Windows Administrator group have this right and can grant it to other users by adding them to the Perform Volume Maintenance Tasks security policy.
  • As a general rule, set the combined value over all the instances to about 2 GB less than the actual memory available on the host. This will secure enough available memory for the operating system to function optimally.
  • Another low-effort, high-reward action is splitting up the files that comprise the TempDB. There’s only one TempDB per SQL Server instance, so it’s often a performance bottleneck. Make sure that the disk subsystem that holds the TempDB files is up to the task. Increase the number of data files that make up your TempDB to maximize disk bandwidth and to reduce contention in allocation structures
  • Generally, if the number of logical processors is less than or equal to eight, use the same number of data files as logical processors. If the number of logical processors is greater than eight, use eight data files; if contention continues, increase the number of data files by multiples of four (up to the number of logical processors) until the contention is reduced to acceptable levels or make changes to the workload/code. It is also best to spread these different files over multiple disk systems and to keep all files the same size.
  • The log file for TempDB should remain a single file at all times.
  • It is also recommended that you size the TempDB according to the Operations Manager environment. The default size for TempDB is 8 MB with a 1-MB log file. Every time you restart SQL, it will re-create this 8-MB file from the model database
  • TempDB optimization is explained in detail at http://technet.microsoft.com/en-us/library/ms175527(v=sql.105).aspx.
  • For SQL servers running in Azure IaaS, please refer to http://msdn.microsoft.com/library/azure/dn248436.aspx and http://msdn.microsoft.com/en-us/library/azure/dn133149.aspx.
  • Some SQL teams automatically assume that all databases should be set to Full recovery model. This requires backing up the transaction logs on a regular basis, but gives the added advantage of restoring up to the time of the last transaction log backup. This approach does not make as much sense for Operations Manager
  • The default and recommended settings for SQL Server in an Operations Manager environment are Windows authentication, SQL Server uses Windows logins
  • It is best practice to use a domain account to run your SQL Server service (MSSQLSvc). The problem with this is that if your SQL Server service is not running as either the server’s system account or a domain administrator, SQL Server cannot register its Service SPN when the service is started. If the SQL Server service does not have sufficient rights, you can use the SETSPN tool manually as a domain administrator to register the necessary SPNs.
  • More information about SPNs in Operations Manager can be found at http://blogs.technet.com/b/kevinholman/archive/2011/08/08/opsmgr-2012-what-should-the-spn-s-look-like.aspx.
  • By default, Operations Manager does self-maintenance. Since most Operations Manager administrators are not SQL Database Administrators (DBAs), Microsoft implemented several rules in Operations Manager to automatically keep the databases optimized. These maintenance tasks are defined as system rules in the Operations Manager management pack, one of the management packs installed by default when you install Operations Manager. Since these maintenance tasks run automatically, be careful that your own maintenance tasks do not conflict with the built-in system rules (if you or the DBA decide to implement additional maintenance).
  • For the Operations Manager Data Warehouse, an automatic maintenance job runs every 60 seconds. This job, coming from the Standard Data Warehouse Data Set maintenance rule, does many things, of which re-indexing is only one. All the necessary tables are updated and re-indexed as needed. When a table is 10 percent fragmented, the job re-organizes it. When the table is 30 percent or more fragmented, the index is re-built. Therefore, especially since the built-in maintenance runs every 60 seconds, there is no need for a DBA to run any UPDATE STATISTICS or DBCC DBREINDEX maintenance commands against this database.
  • By default, the block size of any disk less than 16 TB is 4 K. Since SQL Server reads in 64-K increments, it is best practice to format the disk containing the SQL data and log files with 64-K block size. You can only set this allocation unit size when you format the disk.
  • If you use the wrong collation, searches may be less effective or not work at all, sorting might produce unexpected results, and other problems can happen when inserting or retrieving data.
  • If a SQL Server collation other than SQL_Latin1_General_CP1_CI_AS is specified when you create the database, you will have to reinstall Operations Manager and create another database to fix this problem because you cannot change the collation after installing Operations Manager.
  • The registry key path where settings for the Data Access Layer are included is:
    • HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL
    • The DWORD setting, called DALInitiateClearPool, is used by the Data Access Service to control whether to reconnect to the database after a period of unavailability. The default value is 0 (disabled). The recommendation is to enable this feature by setting the value to 1 (decimal).
  • The Persistence Manager feature is used by the Health Service to read and write data to the local database. The local or cache database is called HealthServiceStore.edb, and it is a Microsoft Jet Database Engine database. The registry key path for settings belonging to this feature is:
    • HKLM\SYSTEM\CurrentControlSet\Services\HealthService\Parameters
    • The setting responsible for how often Persistence Manager writes data from memory to the disk is called Persistence Checkpoint Depth Maximum of type DWORD and is measured in bytes. The default value for this setting is 20971520 (decimal) bytes. On management servers that handle a large number of objects not managed directly by agents, such as SNMP Devices, Groups, URL Monitors, Cross-Platform Agents, and so on, you may need to increase this value to relieve disk pressure. The recommended value is 104857600 (decimal).
  • Health Manager is used by the Health Service to calculate and track the health state of each monitor of each object it monitors. The registry path for settings belonging to this feature is:
    • HKLM\SYSTEM\CurrentControlSet\Services\HealthService\Parameters
    • The important setting for the Health Manager is State Queue Items of type DWORD. This sets the maximum size (in bytes) of the state data queue. If the value is too small or if there are too many workflows running (based on the number of objects being managed), there could be possible state change data loss. The default value for this setting is calculated by the Health Service on startup based on how many objects it needs to manage. For agents in a small environment, this value is set to 1024 (decimal). The value is set to 10240 (decimal) on management servers in a mid-size environment. For large environments, on management servers that manage many objects, the default is 25600 (decimal). The recommendation is to double these default values, depending on where it is needed—for an agent that manages a lot of objects or a management server.
  • Do not change the settings for Pool Manager unless advised by Microsoft Support after a proper analysis of the environment, behavior of the resource pools, and load on the management servers. If these settings are changed, it is important to make sure that they are changed to the same value on all management servers in the environment.
  • To remove a server from the resource pools with automatic membership, first set the group membership to manual (automatic is the default). This can be done only from within Windows PowerShell as follows:
    • Get-ScomResourcePool –DisplayName “<Resource Pool Name>” | Set-SCOMResourcePool–EnableAutomaticMembership 0
    • After you run this command, you can then use either Windows PowerShell or the console to remove the management server.
  • Daily tasks
    • Use the imported management packs (general views, Management Group Health dashboard view and reports) to verify that the Operations Manager features are healthy.
    • Check that alerts from the previous day are not still in state of New. Check the repeat counts and date created for your alerts.
    • Check for any unusual alert or event noise; investigate further if required (for example, failing scripts, WMI issues, grey agents, and so on).
    • Check the status of all agents for any state other than green. Verify that all managed computers are communicating.
    • Review nightly backup jobs and database space allocation.
    • Verify that predefined maintenance tasks scheduled to run daily are running successfully.
    • Check the Operations Manager event logs on each management server for unusual behavior and error events.
  • Weekly tasks
    • Schedule weekly meetings with operational application owners to review previous most common alerts and events.
    • Use the top-down approach to running the Most Common Alerts and Most Common Events reports. Investigate further where necessary.
    • Run the Data Volume by Management pack and Data Volume by Workflow and Instance reports.
    • Check available disk space on all Operations Manager database systems (data and log files).
  • Monthly tasks
    • Check for new management pack versions of any installed management packs. Also check the management pack guides of newly released management packs to determine whether they meet the requirements of your organization and are suitable for your environment.
    • Review the baselines (performance counters) to assess the ongoing performance of the Operations Manager environment as new agents and management packs are added.
    • Review the disaster recovery plan for any needed changes.
  • Microsoft’s Brian Wren has put together a very extensive authoring guide, which can be accessed at http://social.technet.microsoft.com/wiki/contents/articles/15251.system-center-management-pack-authoring-guide.aspx.

 

Chapter 02: Best Practices for Working with Management Packs

  • The product group that creates the product also makes the management packs, so you will have the combined knowledge of the people who created the product to assist you with monitoring your applications in the most recommended way
  • Brian Wren has done an outstanding job writing the System Center Management Pack Authoring Guide, which you can find at http://social.technet.microsoft.com/wiki/contents/articles/15251.system-center-management-pack-authoring-guide.aspx
  • There is also an MSDN Channel9 series about management packs available at http://channel9.msdn.com/Series/System-Center-2012-R2-Operations-Manager-Management-Packs and a Microsoft Virtual Academy series available at http://www.microsoftvirtualacademy.com/training-courses/system-center-2012-r2-operations-manager-management-pack.
  • For more information, see the article “MP Best Practice: Using the Seed Pattern for Easy Discovery” on the TechNet Wiki at http://social.technet.microsoft.com/wiki/contents/articles/1208.mp-best-practice-using-the-seed-pattern-for-easy-discovery.aspx.
  • For more information, see the article “How to detect and troubleshoot frequent configuration changes in Operations Manager” on Microsoft Support at http://support.microsoft.com/kb/2603913. See also Kevin Holman’s article “What is config churn?” on his blog at http://blogs.technet.com/b/kevinholman/archive/2009/10/05/what-is-config-churn.aspx.
  • For more information, see Kevin Holman’s article “Tuning tip: Do you have monitors constantly flip flopping?” on his blog at http://blogs.technet.com/b/kevinholman/archive/2009/12/21/tuning-tip-do-you-have-monitors-constantly-flip-flopping.aspx.
  • When you seal a management pack, the file is digitally signed by the provider and the user knows that it hasn’t been modified since then.
    To upgrade a sealed management pack, the same key must be used or the upgrade will fail
  • Summary of best practices
    • In summary, here is a list of the most important things to consider when working with management packs:
    • Class properties you choose should change values as seldom as possible, close to never.
    • Don’t use Operations Manager for software inventory (System Center Configuration Manager is built to do that), and don’t collect too many properties.
    • Monitors should change their state as seldom as possible. They should not be too sensitive, and the related issue that is described in the alert should be resolved in a more permanent manner.
    • The type space should be kept as small as possible. Import or create only what you need and delete what you do not use.
    • Windows PowerShell scripts that connect to the Data Access Service should be kept to a minimum. At least try to develop them in a way that loads as few objects as possible by using selection criteria for the Operations Manager cmdlets.
    • Don’t over-use maintenance mode. If there is no way around it, reduce database grooming settings for state change events data.
    • Targets for workflows should be as specific as possible. Use seed classes with lightweight discovery rules for custom application monitoring.
    • Tune existing workflows using overrides. Disable unneeded workflows, adjust thresholds, set higher run intervals, and so on.
    • Prefer static groups instead of dynamic groups, or at least try to use lightweight criteria for your dynamic groups.
    • Change the group calculation interval when there are many groups in the Operations Manager environment.
    • Configure before you customize. Determine if you can use an existing workflow for what you need instead of creating a new one.
    • Classes, groups, modules, and so on should be in a sealed management pack so that they are not unexpectedly modified and so that they can be referenced by content in other management packs.
  • Management Pack Viewer (MP Viewer) was first developed by Boris Yanushpolsky and later updated for Operations Manager 2012 and management pack bundle (MPB) files by Daniele Muscetta. The download link for this tool is http://blogs.msdn.com/b/dmuscett/archive/2012/02/19/boris-s-tools-updated.aspx.
  • Not all available management packs are divided into Discovery, Monitoring, and Presentation parts. If everything is in one management pack file, the following explanation is still valid. However, since dividing a management pack into these three different parts is best practice for building your own management packs, you should follow the example set by the SQL Server management pack.
  • When creating your own management packs, you shouldn’t use a broad class for all of your discoveries because it will negatively impact the performance of Operations Manager. Use a broad class only for the base discovery or the seed discovery.
  • For best practices on how to configure overrides, please see the following Microsoft Knowledge Base article: http://support.microsoft.com/kb/943239.
  • More information on relationships can be found at http://social.technet.microsoft.com/wiki/contents/articles/14256.operations-manager-management-pack-authoring-classes-and-relationships.aspx#Relationships.
  • Thresholds don’t appear in the overview window of MP Viewer or in the Excel or HTML file when you export the management pack. To view them, select a rule or a monitor, and then click the Knowledge, Alert Description, or Raw XML tab (for monitors) or click the Knowledge or Raw XML tab (for rules) in the bottom right pane. When you select Raw XML, you will see the actual XML code that makes up the management pack. In this raw XML code, you can also see the thresholds
  • The MP Wiki found at http://social.technet.microsoft.com/wiki/contents/articles/16174.microsoft-management-packs.aspx contains all the Microsoft-provided management packs and their release dates. Check this page often to determine whether you have the latest version of the management packs installed.
  • You cannot change the target for an override using the Operations Manager console. Instead, you must note the changes you make in the specific override, delete it, and then re-create it with the new target
  • If groups are created with extended authoring tools (or directly in XML using your preferred XML editor), they can and should be based on Windows Computer objects hosting special applications, for instance, a Windows Computer group that contains only Windows computers based on a discovered custom special application class. For notifications, the corresponding Health Service Watcher objects could be added to the group. This is necessary because you need the Health Service Watcher objects for Operations Manager self-monitoring alerts like Heartbeat Failures or Computer Not Reachable to be included too. Also remember to add cluster objects (if you need cluster-based alerts), which are not hosted by Windows Computer.
  • More information about building and understanding groups is covered in Kevin Holman’s blog at http://blogs.technet.com/b/kevinholman/archive/2010/07/27/authoring-groups-from-simple-to-complex.aspx.

 

Chapter 03: Getting the Most out of Operations Manager Visualizations

 

Chapter 04: Troubleshooting your Operations Manager Environment

  • The most basic thing to check is that the information event 6022 is being logged periodically, which indicates that the HealthService is running at least some workflows (through MonitoringHost processes) and is not in a hung state or something similar.
  • It is more than enough to go through the events from the past 6 to 10 hours because if there is a failure at some point, that failure will repeat itself often.
  • Usually, you should first filter the event log just on Error and Warning events (Operations Manager never triggers a Critical level event).
  • It is good to go through each Error or Warning event and make an analysis along these lines:
    • What is the frequency of the event?
    • What is the exact event description?
    • For events with the same event ID, are these really the exact same event based on a careful comparison of the event description?
    • If you see a problem event for some workflow that you know should run every 10 minutes, is the last such event fresh or is it too old, maybe indicating this was a one-time problem?
    • Is there one or more events that seem to be spamming the event log? For example, do you see the same event 50 times in 1 second, or something similar?
  • There can be two (or more) events that have the same event ID and exact same event description, but with a very specific and important difference: a different error code in the description
  • Event Tracing for Windows (ETW) is a tracing technology that Windows uses. ETW tracing is also used in most Microsoft software and in Operations Manager as well. A tutorial about the ETW framework can be found at https://technet.microsoft.com/en-us/library/jj714799.aspx.
    To start the trace with all available providers (trace everything), you can follow the Knowledge Base article at http://support.microsoft.com/kb/942864.
  • SQL Server Profiler Tracing is a feature of Microsoft SQL Server that allows you to trace different actions in SQL Server (see https://msdn.microsoft.com/en-us/library/ms181091(v=sql.110).aspx)
  • One of the most important aspects of maintaining a healthy and performant Operations Manager environment is management pack tuning. Each time you import a new management pack, you need to monitor the data it collects and how it behaves in the following one to two weeks.
  • Another reason for a big StateChangeEvent table is state change events, which are older than the data grooming setting for state changes (default 7 days). This can happen if you manually (or via some automated/scripted method) close alerts without resetting the monitors that raised them. It is against best practice to do this because the grooming stored procedure to clean-up state changes does not also delete state changes that belong to a monitor that is not in the Healthy state. Additionally, a high number of state changes might cause the stored procedure to time out and not be able to delete everything.
  • Many different issues on either management servers or agents are caused by known problems with certain versions of the Windows operating system. Because of this, a Knowledge Base article listing recommended Windows operating system (version dependent) hotfixes and update rollups is available at http://support.microsoft.com/kb/2843219.
  • A good Knowledge Base article to help troubleshoot the different scenarios for agents that are displayed as gray is available at http://support.microsoft.com/kb/2288515. One of these scenarios also describes the presence of warning event 2115, which would most likely appear on management servers or gateway servers and may involve performance problems. Another great Knowledge Base article for troubleshooting this issue in detail is available at http://support.microsoft.com/kb/2681388.

 

Chapter 05: Using Operations Manager in Cloud Environments

Leave a Reply

%d bloggers like this: