Git Product home page Git Product logo

Comments (27)

ritchiecarroll avatar ritchiecarroll commented on May 23, 2024

What version of the openHistorian are you using?

How is the data flowing into the openHistorian?

Is this PMU data?

Thanks,
Ritchie

from openhistorian.

yuwenpeng avatar yuwenpeng commented on May 23, 2024

@ritchiecarroll ,
Thank you Ritchie!
We are running openHistorian 2.0.415 on Windows Server 2012 with .Net framework 4.5 and openHistorian 2.1 release on Windows 10 with .NET framework 4.6. Both of them are suffering from missing data.
It is FDR's data using FNet protocol. We have a self-developed server which receives data from FDR directly, then forward data to openHistorian. The 'ConnectionString' of devices in openHistorian configuration database is 'transportprotocol=Tcp;......'. According to the action, all data is received by openHistorian.

What version of the openHistorian are you using?
How is the data flowing into the openHistorian?
Is this PMU data?

from openhistorian.

ritchiecarroll avatar ritchiecarroll commented on May 23, 2024

So you have developed a custom action adapter that runs inside the openHistorian that receives data from the FDRs? Or are you using the built-in FNET device protocol?

The only thing I can think of is that if the timestamps of the incoming data were duplicated - these points would be harder to extract since the historian "key" is based on ID, Timestamp then a counter.

from openhistorian.

yuwenpeng avatar yuwenpeng commented on May 23, 2024

@ritchiecarroll
A custom action adapter runs inside the openHistorian. The adapter receives data from openHistorian by implementing the function ConcentratorBase.PublishFrame(IFrame frame, int index).

So you have developed a custom action adapter that runs inside the openHistorian that receives data from the FDRs? Or are you using the built-in FNET device protocol?

from openhistorian.

yuwenpeng avatar yuwenpeng commented on May 23, 2024

@ritchiecarroll
In the custom action adapter, I write the timestamp and values count of every IFrame into a text file. I didn't find any duplicated timestamp.

The only thing I can think of is that if the timestamps of the incoming data were duplicated - these points would be harder to extract since the historian "key" is based on ID, Timestamp then a counter.

from openhistorian.

ritchiecarroll avatar ritchiecarroll commented on May 23, 2024

I am assuming the actual F-NET data comes in using a standard device connection using the built-in UTK FNET protocol?

If so, the steps below will help us see if the outage is due to a re-connect - it could be that there is a delay in the source data stream and device simply reconnects, a standard operation when data stops flowing. By default this timeout is set 5 seconds which may be too short and the device is reconnecting. By the way, this setting is called "Data Loss Interval" and can be found on the openHistorian Manager device configuration screen.

From the openHistorian machine connected to the remote FNET streams, run the “Statistics Trending Tool” – as named in the start menu, actual EXE name is StatisticView.exe in the openHistorian installation folder. Once running, connect to the Statistics historian – note that application should already default to this archive when no other local archive is installed, but just in case, default statistics archive path is “C:\Program Files\openHistorian\Statistics\” – this assuming openHistorian was installed on C: drive.

Now, for the connected FNET device, find the statistic that ends with “!IS:ST8”, i.e., input stream statistic number 8 which defines a boolean value representing if the input stream was continually connected during last reporting interval. For statistics the default reporting interval is every 10 seconds, so if the stat is non-zero the device was continually connected over the 10 second period.

If you select this statistic and trend this value over the time window for the data gap you found, you will then know if the device connection was interrupted and hence caused the data gap.

If the device connection was interrupted, you can go through logs (using “Log File Viewer” application) and check the messages for the device around the same timeframe – from this you can get a clue if the connection was terminated by the remote device or reset because data stopped flowing, this often due to a change in network, e.g., router reboot, etc.

If this is not the case then there may be something else going on and I would suggest carefully monitoring the logs for errors or connection issues.

Thanks,
Ritchie

from openhistorian.

yuwenpeng avatar yuwenpeng commented on May 23, 2024

@ritchiecarroll
Thank you for your kindly help!
All the devices use built-in UTK FNet protocol.
One of the time windows of missing data is 02/08/2017 0:02:39 - 0:02:42. I picked four devices which have missing data, as shown below. It seems all of them were continually connected in the time window.
image
Actually, almost all devices (around 100) lost data in that time window, and we didn't find any connection status change. The status was updated at 0:02:26, 36, 46, 56... And the action adapter received all the data.
image
image

from openhistorian.

ritchiecarroll avatar ritchiecarroll commented on May 23, 2024

We may want you to install a newer build so that we can have enhanced logging on that we can evaluate when a gap occurs. Would this be possible? If so you can update existing version with a nightly build:

https://www.gridprotectionalliance.org/NightlyBuilds/openHistorian/Beta/openHistorian.Installs.zip

With the new version we have a new detailed logging system that will provide much more detail. Also, new archive log files are time-stamped so that we can find overlapping logs around time of detected data gap. After installation the log files will end up in the "C:\Program Files\openHistorian\Logs" folder.

Thanks,
Ritchie

from openhistorian.

ritchiecarroll avatar ritchiecarroll commented on May 23, 2024

Also - what about data quality? Are we sure received data is not NaN? Are you writing captured data from your custom adapter to a file that we could review?

Thanks,
Ritchie

from openhistorian.

yuwenpeng avatar yuwenpeng commented on May 23, 2024

@ritchiecarroll
That's great! Let me install this newer build to get detailed log information.

Also - what about data quality? Are we sure received data is not NaN? Are you writing captured data from your custom adapter to a file that we could review?

Right now the adapter didn't write data to files. Only the timestamp and value count are captured. Since we have another self-developed server receiving the same forwarded data and didn't have any problem, the data quality should be good. To make it 100% sure, I'll write all the data to files after installing the aforementioned build.
Thank you very much!
Wenpeng

from openhistorian.

yuwenpeng avatar yuwenpeng commented on May 23, 2024

@ritchiecarroll
With the new build, I got another time window of missing data [UTC 2/8/2017 23:18:20.5, 23:18:22.7]
image

Didn't find any connection status change.
image

The custom action adapter received all the data, and the data quality is good
image
The value counts (the last column) received by adapter didn't change significantly, around 80 to 88, in several hours.
image

It seems all data is received by openHistorian and the data quality is good. But something stopped openHistorian from archiving some of time frames.

from openhistorian.

ritchiecarroll avatar ritchiecarroll commented on May 23, 2024

OK, thanks. Let's get a copy of the log files around that time frame and see what interesting was going on.

from openhistorian.

yuwenpeng avatar yuwenpeng commented on May 23, 2024

@ritchiecarroll
Here is the local time (column 0) and UTC time (column 1)
image
This is the screenshot of LogfileViewer. Usually there are at lease 5 log messages in 10 seconds. But there isn't any log from 18:18:18 - 18:18:28. The log file is also attached.

capture

This is the log file. Please change the extension back to .logz. I'm not allowed to upload logz or zip file...
20170208-225631-007026 - openHistorian - 1.pdf

Is there any other log files which may have useful information?

from openhistorian.

StevenChisholm avatar StevenChisholm commented on May 23, 2024

from openhistorian.

ritchiecarroll avatar ritchiecarroll commented on May 23, 2024

Looks like there is a pause, may be Garabage Collection related and we will want to validate GC settings in your config file (i.e., openHistorian.exe.config).

Regardless, the source data stream and device simply reconnects when no data is received during pause, again, a standard operation when data stops flowing. By default this timeout is set 5 seconds which must be too short and the device is reconnecting. This setting is called "Data Loss Interval" and can be found on the openHistorian Manager device configuration screen.

Suggest changing this to 15 seconds for each input device to see if data loss stops.

Thanks,
Ritchie

from openhistorian.

yuwenpeng avatar yuwenpeng commented on May 23, 2024

Added EnableThreadPoolMonitoring=;, and set 'Data loss interval' to 15 by modifying table 'Device' in configuration database directly. And openHistorian is restarted. Hope it could help.

To verify this is the root cause, turn on ThreadPoolMonitoring and restart openHistorian. You can do this by modifying your openhistorian.exe.config file. Find the line that says OptimizationsConnectionString and add EnableThreadPoolMonitoring=;

Suggest changing this to 15 seconds for each input device to see if data loss stops.

I still have some questions. According to the custom action adapter and statistic database, all the data is received by openHistorian and all devices are continually connected. And, if there is a pause, we should observe an obvious change in the time difference between time stamp (UTC) and the log time, which is the local time when the custom adapter received the time frame. But according to the log files the time difference is 2.3 seconds and didn't change.

from openhistorian.

ritchiecarroll avatar ritchiecarroll commented on May 23, 2024

I agree - that's what is strange about all this. Also, there are some very unusual messages in your error log. Perhaps the next best step is to have a WebEx and take a look?

Want to send me an e-mail about scheduling this?

Thanks,
Ritchie

from openhistorian.

yuwenpeng avatar yuwenpeng commented on May 23, 2024

@ritchiecarroll
I modified the configuration as you and Steven Chisholm suggested and openHistorian has run several hours. There are still missing data.
Thank you for taking time to help us! I'm flexible to have WebEx. What's your available time?

from openhistorian.

ritchiecarroll avatar ritchiecarroll commented on May 23, 2024

Tomorrow afternoon? Please send me an e-mail at my GPA e-mail address and we will arrange this...

from openhistorian.

yuwenpeng avatar yuwenpeng commented on May 23, 2024

@ritchiecarroll @StevenChisholm
I'm digging into the source code, and traced down to GSF.Snap.Collection.SortedPointBuffer<TKey,TValue>.TryEnqueue(Tkey,TValue> for PrebufferWriter
It seems the SortedPointBuffer received all the data. I need to go deeper to find where the data get lost..

from openhistorian.

yuwenpeng avatar yuwenpeng commented on May 23, 2024

@ritchiecarroll @StevenChisholm
It seems sometimes data may get lost when creating UnionTreeStream<TKey, TValue>
UnionTreeStream<TKey, TValue> is used to merge PendingTable to the next level or rollover pending tables from all levels to the NextStage archive. But sometimes the merging procedure stops unexpectedly and the remaining data gets lost. One example is shown below:
10 PendingTable2 are merged to PendingTable3. There are 53021 values in the 10 PendingTable2, but only 15674 of them are copied to PendingTable3. This could also happen in the rollover from FirstStage to NextStage archive.
Still don't know what is the reason which stops the merging procedure.
image

from openhistorian.

yuwenpeng avatar yuwenpeng commented on May 23, 2024

@ritchiecarroll @StevenChisholm
Duplicated data may be the reason. But not sure yet..

from openhistorian.

yuwenpeng avatar yuwenpeng commented on May 23, 2024

@StevenChisholm
@ritchiecarroll
I think I got the bug. It is in UnionTreeStreamSortHelper.SortAssumingIncreased(int index), at line 116 as shown below.
image.
x should not start from 0 but index. Once there are duplicated keys in BufferedTreeStreams, and the duplicated key happens to be the last key of the second (or third and so on) BufferedTreeStream, this sort function will be called and make mistake.

One of the captured examples is shown below, in which 2017-02-12 00:37:22.3000000/785 is the duplicated key.
There are 3 sections:

  1. The original order of BufferedTreeStreams and their cache key and valid flag before removing duplicated key.
  2. Information of the streams after removing duplicated key by advancing the position of the second duplicated key.
  3. Information and order of streams after sorting the streams.
    It caused 3 remaining streams ignored.
    image

Then the data get lost.
image

from openhistorian.

StevenChisholm avatar StevenChisholm commented on May 23, 2024

from openhistorian.

yuwenpeng avatar yuwenpeng commented on May 23, 2024

The problem was solved.
Sometimes meter may resend data due to network issue, this is why we get duplicated key.
Previously the data losing happened around 7 times every hour in average. Now there isn't any missing data in the last 15 hours.
Thank you for all your help! @ritchiecarroll @StevenChisholm

Best wishes,
Wenpeng Yu

from openhistorian.

StevenChisholm avatar StevenChisholm commented on May 23, 2024

from openhistorian.

ritchiecarroll avatar ritchiecarroll commented on May 23, 2024

I'll add my thank you as well - nice job @yuwenpeng! I'll close this for now given the discovered fix:
437ed9d

Thanks again!
Ritchie

from openhistorian.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.