[subexp-daq] Report of a possible bug of the CAEN_V560 module
Weber, Guenter Dr.
g.weber at hi-jena.gsi.de
Thu Feb 22 10:04:28 CET 2024
Dear friends,
after the bug in map_map was fixed, the freeze does not happen again. Very good!
Now back to my original concern regarding the V560 module ...
readout_dt looks like this:
uint32_t
caen_v560_readout_dt(struct Crate *a_crate, struct Module *a_module)
{
(void)a_crate;
LOGF(spam)(LOGL, NAME" readout_dt {");
a_module->event_counter.value++;
LOGF(spam)(LOGL, NAME" readout_dt(ctr=0x%08x) }",
a_module->event_counter.value);
return 0;
}
The module counter is incremented by one every time readout_dt is executed. This results in a problem in crate.c:
diff_module = COUNTER_DIFF(*module->crate_counter,
module->event_counter, module->this_minus_crate);
/* TODO: Clean this. */
shadow_counter.value =
module->shadow.data_counter_value;
shadow_counter.mask = module->event_counter.mask;
diff_shadow = COUNTER_DIFF(*module->crate_counter,
shadow_counter, module->this_minus_crate);
create_do_shad = !crate_get_do_shadow(a_crate);
printf("%s: diff_module: %u, module_crate_counter: %u, module_event_counter: %u, module_this_minus_crate: %u \n", keyword_get_string(module->type), diff_module, (*module->crate_counter).value, (module->event_counter).value, module->this_minus_crate);
if (0 == diff_module &&
( create_do_shad ||
NULL == module->props->readout_shadow ||
0 == diff_shadow)) {
ok = 1;
printf("%u \n", ok);
break;
}
getchar();
When the difference between (*module->crate_counter).value and (module->event_counter).value is evaluated the later was already incremented as readout_dt for the module was already executed while the former counter was not incremented.
This is the output:
CAEN_V560: diff_module: 4294967295, module_crate_counter: 0, module_event_counter: 1, module_this_minus_crate: 0
Note that diff_module shows the result of an "0 - 1" operation when working with unsigned integers.
The original version of the crate.c code would now again execute readout_dt of module V560, thus incrementing the module counter another time. Thus, diff_module would be "0 - 2". This would repeat until the timeout condition a bit later in the code is reached.
Then the modules would be re-initialized, thus setting (module->event_counter).value of V560 back to zero. But the crate counter would be incremented. Thus, by shear luck the next try of the same test would have (*module->crate_counter).value and (module->event_counter).value both equal to 1. And from this point the DAQ is running as intended.
Ok, I hope the explanation was clear and I understood correctly what is happening.
Best greetings
Günter
----------------
Günter Weber
Helmholtz-Institut Jena
Fröbelstieg 3
07743 Jena
Germany
Phone: +49-3641-947605
www.hi-jena.de<http://www.hi-jena.de>
GSI Helmholtzzentrum für Schwerionenforschung
Planckstrasse 1
64291 Darmstadt
Germany
www.gsi.de<http://www.gsi.de>
________________________________
Von: subexp-daq <subexp-daq-bounces at lists.chalmers.se> im Auftrag von Weber, Guenter Dr. <g.weber at hi-jena.gsi.de>
Gesendet: Mittwoch, 21. Februar 2024 16:14:45
An: Discuss use of Nurdlib, TRLO II, drasi and UCESB.
Betreff: Re: [subexp-daq] Report of a possible bug of the CAEN_V560 module
Dear Hans,
writing into the register works fine (I tried it several times):
RIO4-MCAL-1 mbsdaq > rwdump -a0x33333350 -w16,0
Address=0x33333350
Raw-write done.
RIO4-MCAL-1 mbsdaq > rwdump -a0x33333350 -w16,0
Address=0x33333350
Raw-write done.
RIO4-MCAL-1 mbsdaq > rwdump -a0x33333350 -w16,0
Address=0x33333350
Raw-write done.
RIO4-MCAL-1 mbsdaq > rwdump -a0x33333350 -w16,0
Address=0x33333350
Raw-write done.
I could now litter map_map with printf() outputs to see where execution of
v560->sicy_map = map_map(v560->address, MAP_SIZE, KW_NOBLT,
0, 0, MAP_POKE_ARGS(fixed_code), MAP_POKE_ARGS(scale_clear));
is failing. Should I proceed this way? Or is there anything else that I could check?
(As I understand, the slightly different implementation of V560 on our running system is not indicative of a specific issue, but just due to fact that this is a deprecated version of NURDLIB. Right?)
Best greetings
Günter
________________________________
Von: Hans Toshihide Törnqvist <hans.tornqvist at chalmers.se>
Gesendet: Mittwoch, 21. Februar 2024 15:28:01
An: Discuss use of Nurdlib, TRLO II, drasi and UCESB.; Weber, Guenter Dr.
Betreff: Re: [subexp-daq] Report of a possible bug of the CAEN_V560 module
Dear Günter,
The most important thing is that you get reasonable values with these
reads, the actual values don't mean a whole lot.
One of the manual reads that you did (ofs=0xfa) is what 'map_map' does
for "poke reading". The macros
MAP_POKE_ARGS(fixed_code), or the older
MAP_POKE_ARGS(*v560->read, fixed_code)
tell 'map_map' what address offset to poke, and it depends on each module.
The next thing that happens in 'map_map' is the "poke writing". Could
you try to write to the 'scale_clear' register next? That would be:
rwdump -a0x33333350 -w16,0
---
In case you would like to look deeper in 'map_map', you can find it in
module/map/map.c around line-number 103. It's not a very complicated
function that does the following:
-) Checks user-mapped memory, you don't need to worry about this, it's
mainly for simulating module memory for tests.
-) Performs the poke-read.
-) Performs the poke-write.
-) If it's a BLT mapping, asks the platform-specific code to do that
without further tests.
-) Otherwise times the poke registers many times to get an idea about
the speed of every single-cycle access.
If you want to dig even deeper, you can look in
module/map/map_xpc_3310.c which is what is used in the most recent Linux
Rio4's. It's mainly a wrapper around a proprietary black-box library, so
not scary and scary at the same time.
Best regards,
Hans
On 2024-02-21 14:32, Weber, Guenter Dr. wrote:
> Dear Hans,
>
>
> with the different register addresses it works.
>
>
> RIO4-MCAL-1 mbsdaq > rwdump -a0x333333fa -r16
> Address=0x333333fa
> Raw-read value=0xfaf5
>
>
> RIO4-MCAL-1 mbsdaq > rwdump -a0x333333fc -r16
> Address=0x333333fc
> Raw-read value=0x083a
>
>
> RIO4-MCAL-1 mbsdaq > rwdump -a0x333333fe -r16
> Address=0x333333fe
> Raw-read value=0x01bc
>
> What can we learn from these numbers?
>
>
>
>
> Best greetings
>
> Günter
>
>
>
> ------------------------------------------------------------------------
> *Von:* Hans Toshihide Törnqvist <hans.tornqvist at chalmers.se>
> *Gesendet:* Mittwoch, 21. Februar 2024 12:43:06
> *An:* Weber, Guenter Dr.
> *Betreff:* Re: [subexp-daq] Report of a possible bug of the CAEN_V560
> module
> Hmm, looks like address offset 0 is "not used", could you try
> -a0x333333fa? Or fe and fc at the end,they should be some read-only
> registers.
>
>
> "Weber, Guenter Dr." <g.weber at hi-jena.gsi.de> skrev: (21 februari 2024
> 12:06:00 CET)
>
> Different VME slot of the V560 module, same result. :-(
>
> ------------------------------------------------------------------------
> *Von:* subexp-daq <subexp-daq-bounces at lists.chalmers.se> im Auftrag
> von Weber, Guenter Dr. <g.weber at hi-jena.gsi.de>
> *Gesendet:* Mittwoch, 21. Februar 2024 11:40:25
> *An:* Hans Toshihide Törnqvist; Discuss use of Nurdlib, TRLO II,
> drasi and UCESB.
> *Betreff:* Re: [subexp-daq] Report of a possible bug of the
> CAEN_V560 module
>
> Dear Hans,
>
>
> the output from manual reading of the module indeed shows a problem:
>
>
> RIO4-MCAL-1 mbsdaq > rwdump -a0x33333300 -r16
> Address=0x33333300
> Raw-read value=rwdump: line 28: 593 Bus error
> $PREFIX $f "$@"
>
>
> The module was working with this address in the other DAQ system (as
> we did not know the order of the individual switches, we set them
> all to "3"). But I can take it our and put it in again at a
> different slot, if maybe this particular slot has a hardware
> problem. (But I never heard of such thing.)
>
>
>
>
> Best greetings
>
> Günter
>
>
> ------------------------------------------------------------------------
> *Von:* Hans Toshihide Törnqvist <hans.tornqvist at chalmers.se>
> *Gesendet:* Mittwoch, 21. Februar 2024 11:14:44
> *An:* Discuss use of Nurdlib, TRLO II, drasi and UCESB.; Weber,
> Guenter Dr.
> *Betreff:* Re: [subexp-daq] Report of a possible bug of the
> CAEN_V560 module
> Dear Günter,
>
> map_map before mapping tries to read and write some given registers
> with a "safe" but slower method of accessing registers, which is
> called "poking" in nurdlib. Maybe the method of access on the rio4
> you have is not safe enough and one of the two pokes fails horribly...
>
> Could you please double check the module address? Could you also try
> using bin/rwdump to read any register in the v560 to see if it's
> accessible at all and not a problem with the module implementation
> in nurdlib?
>
> Something like bin/rwdump -a0x33333300 -r16
>
> Actually the address 0x33333300 looks weird to me, maybe it should
> be 0x33330000?
> Also for reading, try register offsets fa, fc, fe, with 16 bits
> accesseses, they should have some interesting values.
>
> Cheers,
> Hans
>
>
> "Weber, Guenter Dr." <g.weber at hi-jena.gsi.de> skrev: (21 februari
> 2024 10:18:29 CET)
>
> Dear Håkan,
>
>
> thanks for the hint to flush and sleep. Indeed, I now see that
> the crash happens in init_slow of V560 at this line:
>
>
> v560->sicy_map=map_map(v560->address, MAP_SIZE, KW_NOBLT,
> 0, 0, MAP_POKE_ARGS(fixed_code), MAP_POKE_ARGS(scale_clear));
>
>
> Maybe the code is accessing/writing into a memory location that
> it should better not touch?
>
> This problematic line is then followed by:
>
>
> id=MAP_READ(v560->sicy_map, fixed_code);
>
> The corresponding line in the V560 code on the system that was
> running with this module looks like this:
>
>
> v560->sicy_map=map_map(v560->address, MAP_SIZE_MAX(*v560), KW_NOBLT,
> 0, 0, MAP_POKE_ARGS(*v560->read, fixed_code),
> MAP_POKE_ARGS(*v560->write, scale_clear));
>
> And is followed by:
>
>
> mapped_ptr =map_get_mapped_ptr(v560->sicy_map);
> v560->read=mapped_ptr;
> v560->write=mapped_ptr;
>
> Maybe you already have an idea what causes the problem here?
>
>
> I will now go to the system that was running with V560 and make
> a push of the NURDLIB.
>
>
>
>
> Best greetings
>
> Günter
>
>
>
> ------------------------------------------------------------------------
> *Von:* subexp-daq <subexp-daq-bounces at lists.chalmers.se> im
> Auftrag von Håkan T Johansson <f96hajo at chalmers.se>
> *Gesendet:* Dienstag, 20. Februar 2024 20:13:32
> *An:* Discuss use of Nurdlib, TRLO II, drasi and UCESB.
> *Betreff:* Re: [subexp-daq] Report of a possible bug of the
> CAEN_V560 module
>
> Dear Günter,
>
> I took the files you provided and for comparison put them in a
> branch
> 'old_caen_v560'.
>
> git diff origin/old_caen_v560..origin/master
>
> however does not show anything which is suspicious to me.
> Perhaps Hans
> can spot something.
>
> Otherwise, the only idea I can come up with is to continue to
> bisect the
> code inside slow init.
>
> However, before that, I would suggest to add
>
> fflush(stdout); sleep(1);
>
> after each printf statement, such that one can be quite sure
> that the
> printout is not eaten when the RIO crash happens. I.e. that it
> actually
> had gotten further than shown by the prints.
>
> Best regards,
> Håkan
>
>
>
>
> On Tue, 20 Feb 2024, Weber, Guenter Dr. wrote:
>
> >
> > Dear friends,
> >
> >
> > I now had a look at the system where the V560 was running. It was also setup
> > by Bastian. And there the code for the V560 module is slightly different
> > from the one included in the NURDLIB branch that I am using on the test
> > system.
> >
> >
> > Maybe you can have a look at it.
> >
> >
> > I also could push the complete NURDLIB from this system, if this helps.
> >
> >
> >
> >
> > Best greetings
> >
> > Günter
> >
> >
> >
> >
> > ____________________________________________________________________________
> > Von: subexp-daq <subexp-daq-bounces at lists.chalmers.se> im Auftrag von Weber,
> > Guenter Dr. <g.weber at hi-jena.gsi.de>
> > Gesendet: Dienstag, 20. Februar 2024 10:58:27
> > An: Discuss use of Nurdlib, TRLO II, drasi and UCESB.
> > Betreff: [subexp-daq] Report of a possible bug of the CAEN_V560 module
> >
> > Dear friends,
> >
> >
> > I now grabbed a V560 module that was working fine in another DAQ system and
> > put it into our test system.
> >
> >
> > The main.cfg looks like this:
> >
> >
> > log_level=spam # info, verbose, debug, spam
> >
> > CRATE("MCAL") {
> > GSI_VULOM(0x03000000) {
> > timestamp = true # needed to get timestamps in the data output
> > # ecl=0..15
> > }
> > BARRIER
> > CAEN_V560(0x333333300) {
> > use_veto = true
> > }
> > # CAEN_V767A(0x03100000) {
> > # }
> > }
> >
> > Starting the DAQ now results in a freeze of the RIO4. A reset of the crate
> > is necessary to talk to it again.
> >
> >
> > The problem occurs in the first slow init of the V560 module. To find the
> > exact line, I added some output to CRATE.C:
> >
> >
> > 10: crate/crate.c:923: .Slow-init module[0]=GSI_VULOM.
> > before push_log_level(module)
> > before a_crate->module_init_id = module->id
> > before module->props->init_slow(a_crate, module)
> > LOG: TRLO: MD5SUM: 0x1409285e (CT: 63bb1d44 = 2023-01-08 19:45:08 UTC)
> > before module_init_id_mark(a_crate, module)
> > before pop_log_level(module)
> > 10: crate/crate.c:923: .Slow-init module[1]=CAEN_V560.
> > before push_log_level(module)
> > before a_crate->module_init_id = module->id
> > before module->props->init_slow(a_crate, module)
> >
> >
> > The CRATE.C code now looks like this:
> >
> >
> > TAILQ_FOREACH(module, &a_crate->module_list, next) {
> > if (NULL == module->props) {
> > continue;
> > }
> > LOGF(info)(LOGL, "Slow-init module[%u]=%s.", module->id,
> > keyword_get_string(module->type));
> > printf("before push_log_level(module) \n");
> > push_log_level(module);
> > printf("before a_crate->module_init_id = module->id \n");
> > a_crate->module_init_id = module->id;
> > printf("before module->props->init_slow(a_crate, module) \n");
> > if (!module->props->init_slow(a_crate, module)) {
> > printf("before pop_log_level(module) \n");
> > pop_log_level(module);
> > printf("before goto crate_init_done \n");
> > goto crate_init_done;
> > }
> > printf("before module_init_id_mark(a_crate, module) \n");
> > module_init_id_mark(a_crate, module);
> > printf("before pop_log_level(module) \n");
> > pop_log_level(module);
> > }
> >
> > Thus, to me it looks like the check "if (!module->props->init_slow(a_crate,
> > module)) ..." is doing something quite horrible to the RIO4.
> >
> >
> > This is unfortunate, because my original aim was to show that there is also
> > a bug/mistake in readout_dt of the V560 module. But I did not come this far.
> >
> >
> > Do you have any idea what might cause the freezing of the RIO4?
> >
> >
> >
> >
> > Best greetings and many thanks
> >
> > Günter
> >
> >
> >
> >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.chalmers.se/pipermail/subexp-daq/attachments/20240222/0e6ddc66/attachment-0001.html>
More information about the subexp-daq
mailing list