I will also start putting down some ideas on paper about architecture.
I have been preparing a document that outlines a distributed
architecture noting the features talked about below.  My hope was to
perhaps get sopme people working together and build it from scratch,
similar to the way some of these projects like KDE, GNOME and others are
maching these distributed efforts very possible.

Key areas that I will expand upon are (No particular order) -

1) Security on a user-id level.  So some users can add objects, others
can browse them, other can edit maps, etc.  For any large sites this is
critical.

2) Additional object fields such as -
  a) Location: where the object resides - different from the MIB II
object as these can be lost during power outages, etc.
  b) type: What is this equipment?  Ex. Unix Server, NT server, VAX
  c) Support Group: Who supports or is mainly responsible for this
devise?  Ex. Network Group, Mail group, LAN group, etc.
  d) End Users:  List of end users that use the devise. (Can be many).
This can enhance apps like event that can then just look at what devices
impact a certain user type.
  Apps like events can use these to see just all events affecting a
given location, or UNIX folks can see just all events they care about,
etc.
 
3) Separate the polling engine from the display portion. Have the
display portion (graphics and the events apps) run and connect through
sockets with a poller.  The poller checks the status of all systems, and
then when an exception occurs, notifies the display unit.  So
communication between the two systems would be minimal.  This will allow
to -
  a) have a large number of users looking at the same data.
  b) Speed would be extremely great for remote users and/or dial up
users
  c) Allow for very easy extension to the poller.  So perhaps down the
line you can have distributed pollers that all feed one main poller, to
which applications connect to.
  d) Makes the system more modular and easier to extend (Other disagree
with me here but as long as multiple people would work on this I still
say this is true).
  e) Allow for future display systems to be written just on Java, or CGI
or on a PC.

4) Have the polling system be SNMP independent.  So, a system can be
checked for availability with ICMP or SNMP or (IN THE FUTURE) Novell's
IPX, etc.  The availability can also be a script, which is checking some
legacy app (for example I have scripts to check non-snmp X25 devices).  

5) Ensure the concept of services running on a system is there.  For
example FTP and Telnet.  So services can check their availability, and
yet when the host is not reachable they don't check it as it should be
down. 

6) Have the DB understand systems dependencies.  If router X is down,
then don't expect to see devices X, Y, and Z up (If you do send a
misconfiguration alarm).  This can be done by hand versus automatic.

7) Each alarm should have a category and detail subfield.  For example
the category can be AVAILABILITY and the subfield could be DEVICE DOWN
or DEVICE COULDn't BE VERIFIED.  Or category can be UTILIZATION and
subcategory can be TOO MANY USERS ON, or CPU TOO HIGH or NETWORK
UTILIZATION TOO HIGH.  These help on presentation and reporting.

8) Have the system understand the concept of an object called LINK or
LINE.  This object is made is made up of an interface at one end and an
interface at the other end.  Any event affecting either end affect this
object.  The key is that it makes it easier to do reporting on it and
for people to understand.  Folks don't say the interface 4 on router X
is seeing errors.  they say the line from NY is having errors at the NY
end.

9) Have a higher level app than the events, say call it alarms.  This
app will be more finicky on what it shows.  Alarms should show elements
that are active or problems that happened and cleared themselves.  (This
will take so explaining - but here I will try on quickly).  So a router
is down that shows on the alarms view because that is a current error
condition.  A traffic too high now  also shows.  Once the router comes
back up, it doesn't show on the main app pain as that condition doesn't
exist anymore.  Instead it gets logged to history with the start and end
time of the condition. (I hate on all management apps you get two
entries with start and end of things and makes it impossible to really
figure out anything.)

10) Have a process that is able to perform analysis on the events that
come in.  So for example a rule can be if you see 5 authentication
failures on any host (or host from Location XXXX) within 30 minutes,
create a new event which is critical that says "too many authentication
failures on net (or Location XXX)."

11) Alarms can be assigned/owned/acknowledge by a technician (again if
we have security it is unique).  

12) Alarms can have additional end user information added to them.  So I
as I notice things I can write add on notices (ex. I resetted the box
just now to see if these alarms stop. OR I called Joe Blow to let them
know of this).

13) The poller NEEDS to understand the following -
  a) Wait X polls or Y minutes before notifying something as a problem.
So if a box goes down, check three times before letting me know.  In the
meantime categorize it as a possible problem (warning alarm perhaps -
Yellow color on my map)
  b) Do the same before saying something is up.
  c) have blackout periods.  For example, some devices are always down
between midnight and 3 AM.  Or basically, I don't even care receiving
alarms during those time frames.
 I have though in the past of the concept of a polling entry. The
polling entry says how many SNMP or ICMP retries to send, how often to
poll, etc.  Then for each system you can say, from 8:00 AM to 7:00 PM
apply polling entry X.  from 7:00 PM to 12:00 PM apply polling sequence
Y.  From 12:00 PM to 7:00 AM apply NO Polling sequence.

14) Ability to run scripts when an event is received, or if an event
stays in place for X minutes.  So, if a system remains unavailable for X
minutes, execute the script which will page me (plenty of tools to do
this out there, no need to include on this package).

15) Have a CGI gateway to the poller to see status information.

16) Alarms and event should have a menu choice called impact.  This
noted systems are affected if this system is unavailable or services
affected (ex. FTP or WWW).

17) Ability to define add on apps for the graphical map.  these apps
would show on the menu depending the type of system it is (perhaps an
MIBS supported field would be of use).

Anyway, these are some off the bat thoughts.  As I mentioned I have been
working with network management systems for over 5 years, not so much
programming (Even though you can see my pinger program at
http://www.digitaldaze.com/estrella/) but designing, implementing
testing, selling etc.  So I have developed on my head my dream product
and areas in which the big boys right now are just not cutting the
mustard.  I will expand and start categorizing these for easy.  Also, I
have plenty of WWW space where at a later point I can put a lot of these
information.  Perhaps once an architecture and plan is layed out people
would love to start looking at helping with some of the apps.

Last, have you looked at SNMP++?  Last I checked a Linux port was either
in beta or p-retty much complete.  It looks like it makes it extremely
easy to develop the SNMP code.

Enough from a very excited, Gus Estrella

