[subexp-daq] Report of a possible bug of the CAEN_V560 module

Thu Feb 22 10:04:28 CET 2024

Dear friends,

after the bug in map_map was fixed, the freeze does not happen again. Very good!

Now back to my original concern regarding the V560 module ...

readout_dt looks like this:

uint32_t
caen_v560_readout_dt(struct Crate *a_crate, struct Module *a_module)
{
    (void)a_crate;
    LOGF(spam)(LOGL, NAME" readout_dt {");
    a_module->event_counter.value++;
    LOGF(spam)(LOGL, NAME" readout_dt(ctr=0x%08x) }",
        a_module->event_counter.value);
    return 0;
}

The module counter is incremented by one every time readout_dt is executed. This results in a problem in crate.c:

            diff_module = COUNTER_DIFF(*module->crate_counter,
                module->event_counter, module->this_minus_crate);
            /* TODO: Clean this. */
            shadow_counter.value =
                module->shadow.data_counter_value;
            shadow_counter.mask = module->event_counter.mask;
            diff_shadow = COUNTER_DIFF(*module->crate_counter,
                shadow_counter, module->this_minus_crate);

            create_do_shad = !crate_get_do_shadow(a_crate);
            printf("%s: diff_module: %u, module_crate_counter: %u, module_event_counter: %u, module_this_minus_crate: %u \n", keyword_get_string(module->type), diff_module, (*module->crate_counter).value, (module->event_counter).value, module->this_minus_crate);
            if (0 == diff_module &&
                ( create_do_shad ||
                 NULL == module->props->readout_shadow ||
                 0 == diff_shadow)) {
                ok = 1;
                printf("%u \n", ok);
                break;
            }
            getchar();

When the difference between (*module->crate_counter).value and (module->event_counter).value is evaluated the later was already incremented as readout_dt for the module was already executed while the former counter was not incremented.

This is the output:

CAEN_V560: diff_module: 4294967295, module_crate_counter: 0, module_event_counter: 1, module_this_minus_crate: 0

Note that diff_module shows the result of an "0 - 1" operation when working with unsigned integers.

The original version of the crate.c code would now again execute readout_dt of module V560, thus incrementing the module counter another time. Thus, diff_module would be "0 - 2". This would repeat until the timeout condition a bit later in the code is reached.
Then the modules would be re-initialized, thus setting (module->event_counter).value of V560 back to zero. But the crate counter would be incremented. Thus, by shear luck the next try of the same test would have (*module->crate_counter).value and (module->event_counter).value both equal to 1. And from this point the DAQ is running as intended.

Ok, I hope the explanation was clear and I understood correctly what is happening.

Best greetings
Günter

----------------

Günter Weber

Helmholtz-Institut Jena
Fröbelstieg 3
07743 Jena
Germany
Phone: +49-3641-947605
www.hi-jena.de<http://www.hi-jena.de>

GSI Helmholtzzentrum für Schwerionenforschung
Planckstrasse 1
64291 Darmstadt
Germany
www.gsi.de<http://www.gsi.de>
________________________________
Von: subexp-daq <subexp-daq-bounces at lists.chalmers.se> im Auftrag von Weber, Guenter Dr. <g.weber at hi-jena.gsi.de>
Gesendet: Mittwoch, 21. Februar 2024 16:14:45
An: Discuss use of Nurdlib, TRLO II, drasi and UCESB.
Betreff: Re: [subexp-daq] Report of a possible bug of the CAEN_V560 module

Dear Hans,

writing into the register works fine (I tried it several times):

RIO4-MCAL-1 mbsdaq > rwdump -a0x33333350 -w16,0
Address=0x33333350
Raw-write done.
RIO4-MCAL-1 mbsdaq > rwdump -a0x33333350 -w16,0
Address=0x33333350
Raw-write done.
RIO4-MCAL-1 mbsdaq > rwdump -a0x33333350 -w16,0
Address=0x33333350
Raw-write done.
RIO4-MCAL-1 mbsdaq > rwdump -a0x33333350 -w16,0
Address=0x33333350
Raw-write done.

I could now litter map_map with printf() outputs to see where execution of

    v560->sicy_map = map_map(v560->address, MAP_SIZE, KW_NOBLT,
        0, 0, MAP_POKE_ARGS(fixed_code), MAP_POKE_ARGS(scale_clear));

is failing. Should I proceed this way? Or is there anything else that I could check?

(As I understand, the slightly different implementation of V560 on our running system is not indicative of a specific issue, but just due to fact that this is a deprecated version of NURDLIB. Right?)

Best greetings
Günter

________________________________
Von: Hans Toshihide Törnqvist <hans.tornqvist at chalmers.se>
Gesendet: Mittwoch, 21. Februar 2024 15:28:01
An: Discuss use of Nurdlib, TRLO II, drasi and UCESB.; Weber, Guenter Dr.
Betreff: Re: [subexp-daq] Report of a possible bug of the CAEN_V560 module

Dear Günter,

The most important thing is that you get reasonable values with these
reads, the actual values don't mean a whole lot.

One of the manual reads that you did (ofs=0xfa) is what 'map_map' does
for "poke reading". The macros

MAP_POKE_ARGS(fixed_code), or the older
MAP_POKE_ARGS(*v560->read, fixed_code)

tell 'map_map' what address offset to poke, and it depends on each module.

The next thing that happens in 'map_map' is the "poke writing". Could
you try to write to the 'scale_clear' register next? That would be:

rwdump -a0x33333350 -w16,0

---

In case you would like to look deeper in 'map_map', you can find it in
module/map/map.c around line-number 103. It's not a very complicated
function that does the following:

-) Checks user-mapped memory, you don't need to worry about this, it's
mainly for simulating module memory for tests.

-) Performs the poke-read.

-) Performs the poke-write.

-) If it's a BLT mapping, asks the platform-specific code to do that
without further tests.

-) Otherwise times the poke registers many times to get an idea about
the speed of every single-cycle access.

If you want to dig even deeper, you can look in
module/map/map_xpc_3310.c which is what is used in the most recent Linux
Rio4's. It's mainly a wrapper around a proprietary black-box library, so
not scary and scary at the same time.

Best regards,
Hans

On 2024-02-21 14:32, Weber, Guenter Dr. wrote:
> Dear Hans,
>
>
> with the different register addresses it works.
>
>
> RIO4-MCAL-1 mbsdaq > rwdump -a0x333333fa -r16
> Address=0x333333fa
> Raw-read value=0xfaf5
>
>
> RIO4-MCAL-1 mbsdaq > rwdump -a0x333333fc -r16
> Address=0x333333fc
> Raw-read value=0x083a
>
>
> RIO4-MCAL-1 mbsdaq > rwdump -a0x333333fe -r16
> Address=0x333333fe
> Raw-read value=0x01bc
>
> What can we learn from these numbers?
>
>
>
>
> Best greetings
>
> Günter
>
>
>
> ------------------------------------------------------------------------
> *Von:* Hans Toshihide Törnqvist <hans.tornqvist at chalmers.se>
> *Gesendet:* Mittwoch, 21. Februar 2024 12:43:06
> *An:* Weber, Guenter Dr.
> *Betreff:* Re: [subexp-daq] Report of a possible bug of the CAEN_V560
> module
> Hmm, looks like address offset 0 is "not used", could you try
> -a0x333333fa? Or fe and fc at the end,they should be some read-only
> registers.
>
>
> "Weber, Guenter Dr." <g.weber at hi-jena.gsi.de> skrev: (21 februari 2024
> 12:06:00 CET)
>
>     Different VME slot of the V560 module, same result. :-(
>
>     ------------------------------------------------------------------------
>     *Von:* subexp-daq <subexp-daq-bounces at lists.chalmers.se> im Auftrag
>     von Weber, Guenter Dr. <g.weber at hi-jena.gsi.de>
>     *Gesendet:* Mittwoch, 21. Februar 2024 11:40:25
>     *An:* Hans Toshihide Törnqvist; Discuss use of Nurdlib, TRLO II,
>     drasi and UCESB.
>     *Betreff:* Re: [subexp-daq] Report of a possible bug of the
>     CAEN_V560 module
>
>     Dear Hans,
>
>
>     the output from manual reading of the module indeed shows a problem:
>
>
>     RIO4-MCAL-1 mbsdaq > rwdump -a0x33333300 -r16
>     Address=0x33333300
>     Raw-read value=rwdump: line 28:   593 Bus error
>     $PREFIX $f "$@"
>
>
>     The module was working with this address in the other DAQ system (as
>     we did not know the order of the individual switches, we set them
>     all to "3"). But I can take it our and put it in again at a
>     different slot, if maybe this particular slot has a hardware
>     problem. (But I never heard of such thing.)
>
>
>
>
>     Best greetings
>
>     Günter
>
>
>     ------------------------------------------------------------------------
>     *Von:* Hans Toshihide Törnqvist <hans.tornqvist at chalmers.se>
>     *Gesendet:* Mittwoch, 21. Februar 2024 11:14:44
>     *An:* Discuss use of Nurdlib, TRLO II, drasi and UCESB.; Weber,
>     Guenter Dr.
>     *Betreff:* Re: [subexp-daq] Report of a possible bug of the
>     CAEN_V560 module
>     Dear Günter,
>
>     map_map before mapping tries to read and write some given registers
>     with a "safe" but slower method of accessing registers, which is
>     called "poking" in nurdlib. Maybe the method of access on the rio4
>     you have is not safe enough and one of the two pokes fails horribly...
>
>     Could you please double check the module address? Could you also try
>     using bin/rwdump to read any register in the v560 to see if it's
>     accessible at all and not a problem with the module implementation
>     in nurdlib?
>
>     Something like bin/rwdump -a0x33333300 -r16
>
>     Actually the address 0x33333300 looks weird to me, maybe it should
>     be 0x33330000?
>     Also for reading, try register offsets fa, fc, fe, with 16 bits
>     accesseses, they should have some interesting values.
>
>     Cheers,
>     Hans
>
>
>     "Weber, Guenter Dr." <g.weber at hi-jena.gsi.de> skrev: (21 februari
>     2024 10:18:29 CET)
>
>         Dear Håkan,
>
>
>         thanks for the hint to flush and sleep. Indeed, I now see that
>         the crash happens in init_slow of V560 at this line:
>
>
>         v560->sicy_map=map_map(v560->address, MAP_SIZE, KW_NOBLT,
>         0, 0, MAP_POKE_ARGS(fixed_code), MAP_POKE_ARGS(scale_clear));
>
>
>         Maybe the code is accessing/writing into a memory location that
>         it should better not touch?
>
>         This problematic line is then followed by:
>
>
>         id=MAP_READ(v560->sicy_map, fixed_code);
>
>         The corresponding line in the V560 code on the system that was
>         running with this module looks like this:
>
>
>         v560->sicy_map=map_map(v560->address, MAP_SIZE_MAX(*v560), KW_NOBLT,
>         0, 0, MAP_POKE_ARGS(*v560->read, fixed_code),
>         MAP_POKE_ARGS(*v560->write, scale_clear));
>
>         And is followed by:
>
>
>              mapped_ptr =map_get_mapped_ptr(v560->sicy_map);
>         v560->read=mapped_ptr;
>         v560->write=mapped_ptr;
>
>         Maybe you already have an idea what causes the problem here?
>
>
>         I will now go to the system that was running with V560 and make
>         a push of the NURDLIB.
>
>
>
>
>         Best greetings
>
>         Günter
>
>
>
>         ------------------------------------------------------------------------
>         *Von:* subexp-daq <subexp-daq-bounces at lists.chalmers.se> im
>         Auftrag von Håkan T Johansson <f96hajo at chalmers.se>
>         *Gesendet:* Dienstag, 20. Februar 2024 20:13:32
>         *An:* Discuss use of Nurdlib, TRLO II, drasi and UCESB.
>         *Betreff:* Re: [subexp-daq] Report of a possible bug of the
>         CAEN_V560 module
>
>         Dear Günter,
>
>         I took the files you provided and for comparison put them in a
>         branch
>         'old_caen_v560'.
>
>         git diff origin/old_caen_v560..origin/master
>
>         however does not show anything which is suspicious to me.
>         Perhaps Hans
>         can spot something.
>
>         Otherwise, the only idea I can come up with is to continue to
>         bisect the
>         code inside slow init.
>
>         However, before that, I would suggest to add
>
>            fflush(stdout); sleep(1);
>
>         after each printf statement, such that one can be quite sure
>         that the
>         printout is not eaten when the RIO crash happens.  I.e. that it
>         actually
>         had gotten further than shown by the prints.
>
>         Best regards,
>         Håkan
>
>
>
>
>         On Tue, 20 Feb 2024, Weber, Guenter Dr. wrote:
>
>         >
>         > Dear friends,
>         >
>         >
>         > I now had a look at the system where the V560 was running. It was also setup
>         > by Bastian. And there the code for the V560 module is slightly different
>         > from the one included in the NURDLIB branch that I am using on the test
>         > system.
>         >
>         >
>         > Maybe you can have a look at it.
>         >
>         >
>         > I also could push the complete NURDLIB from this system, if this helps.
>         >
>         >
>         >
>         >
>         > Best greetings
>         >
>         > Günter
>         >
>         >
>         >
>         >
>         > ____________________________________________________________________________
>         > Von: subexp-daq <subexp-daq-bounces at lists.chalmers.se> im Auftrag von Weber,
>         > Guenter Dr. <g.weber at hi-jena.gsi.de>
>         > Gesendet: Dienstag, 20. Februar 2024 10:58:27
>         > An: Discuss use of Nurdlib, TRLO II, drasi and UCESB.
>         > Betreff: [subexp-daq] Report of a possible bug of the CAEN_V560 module
>         >
>         > Dear friends,
>         >
>         >
>         > I now grabbed a V560 module that was working fine in another DAQ system and
>         > put it into our test system.
>         >
>         >
>         > The main.cfg looks like this:
>         >
>         >
>         > log_level=spam # info, verbose, debug, spam
>         >
>         > CRATE("MCAL") {
>         >     GSI_VULOM(0x03000000) {
>         >         timestamp = true # needed to get timestamps in the data output
>         >     #   ecl=0..15
>         >     }
>         >     BARRIER
>         >     CAEN_V560(0x333333300) {
>         >         use_veto = true
>         >     }
>         > #   CAEN_V767A(0x03100000) {
>         > #   }
>         > }
>         >
>         > Starting the DAQ now results in a freeze of the RIO4. A reset of the crate
>         > is necessary to talk to it again.
>         >
>         >
>         > The problem occurs in the first slow init of the V560 module. To find the
>         > exact line, I added some output to CRATE.C:
>         >
>         >
>         > 10: crate/crate.c:923: .Slow-init module[0]=GSI_VULOM.
>         > before push_log_level(module)
>         > before a_crate->module_init_id = module->id
>         > before module->props->init_slow(a_crate, module)
>         > LOG: TRLO: MD5SUM: 0x1409285e (CT: 63bb1d44 = 2023-01-08 19:45:08 UTC)
>         > before module_init_id_mark(a_crate, module)
>         > before pop_log_level(module)
>         > 10: crate/crate.c:923: .Slow-init module[1]=CAEN_V560.
>         > before push_log_level(module)
>         > before a_crate->module_init_id = module->id
>         > before module->props->init_slow(a_crate, module)
>         >
>         >
>         > The CRATE.C code now looks like this:
>         >
>         >
>         >     TAILQ_FOREACH(module, &a_crate->module_list, next) {
>         >         if (NULL == module->props) {
>         >             continue;
>         >         }
>         >         LOGF(info)(LOGL, "Slow-init module[%u]=%s.", module->id,
>         >             keyword_get_string(module->type));
>         >         printf("before push_log_level(module) \n");
>         >         push_log_level(module);
>         >         printf("before a_crate->module_init_id = module->id \n");
>         >         a_crate->module_init_id = module->id;
>         >         printf("before module->props->init_slow(a_crate, module) \n");
>         >         if (!module->props->init_slow(a_crate, module)) {
>         >             printf("before pop_log_level(module) \n");
>         >             pop_log_level(module);
>         >             printf("before goto crate_init_done \n");
>         >             goto crate_init_done;
>         >         }
>         >         printf("before module_init_id_mark(a_crate, module) \n");
>         >         module_init_id_mark(a_crate, module);
>         >         printf("before pop_log_level(module) \n");
>         >         pop_log_level(module);
>         >     }
>         >
>         > Thus, to me it looks like the check "if (!module->props->init_slow(a_crate,
>         > module)) ..." is doing something quite horrible to the RIO4.
>         >
>         >
>         > This is unfortunate, because my original aim was to show that there is also
>         > a bug/mistake in readout_dt of the V560 module. But I did not come this far.
>         >
>         >
>         > Do you have any idea what might cause the freezing of the RIO4?
>         >
>         >
>         >
>         >
>         > Best greetings and many thanks
>         >
>         > Günter
>         >
>         >
>         >
>         >
>         >
>         >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.chalmers.se/pipermail/subexp-daq/attachments/20240222/0e6ddc66/attachment-0001.html>