Home » RBC Forums » General Discussion » Diving Deep into the CTS256A-AL2 Firmware (Latest: (1) Mattel IntelliVoice)
Diving Deep into the CTS256A-AL2 Firmware [message #10847] |
Wed, 09 October 2024 21:07  |
jayindallas
Messages: 110 Registered: June 2021
|
Senior Member |
|
|
DISCUSSION: Diving Deep into the CTS256A-AL2 Firmware
CONTENTS:
1). CTS Code to Say "OK"
2). An Introduction to "Register Files" (CPU RAM)
3). For Efficiency and Speed, Use 8-bit Buffer Constructs Where Possible
4). An Introduction to Assembling TMS7000 Machine Code
5). Say "OK" a Faster Way
6). Next Time... An introduction to the ZIP_File_01.zip, A useful collection...
7). Coming Soon... Using TRAP n instructions to saves 122 bytes of CTS codespace.
CTS Code to Say "OK"
When you power up a CTS256A-AL2 CODE-TO_SPEECH processing chip with the circuitry including the SP0256 Narrator(tm) Speech Processor, it silently initializes and if everything is working, the 'CTS' will make the 'SP0' voice-synthesize, "O-K". As that's the first thing we hear, its an interesting and easy place to start looking at the internal CTS original ROM code.
ADDR MCODE Time ifJP LINE# LABEL OPCODE OPERANDS BYTES REALTIME Line by line description
===== ======== ==== ==== ===== ====== ====== ========== ===== =========== ========================
___ ___________
F1A8: 4F2D4B - - 242: STROK TEXT 'O-K' | 3B| 0Tc| The 'O-K' data, takes 3 bytes, but zero time
F1AB: 0D - - 243: BYTE >0D | 1B| 0Tc| The 'carriage return' taked 2 byte, zero time
F1AC: 73F90A 9Tc - 245: SAYOK AND %>F9,R10 | 3B| 1x09= 9Tc| This sets a buffer necessary flag
F1AF: C5 5Tc - 246: CLR B | 1B| 1x05= 5Tc| Clear Register B, initialize the loop counter (0:4)
F1B0: AA1000 13Tc - 248: LF1B0 LDA @STROK(B) | 3B| 4x13= 52Tc| 4x: get next B indexed character, put in Register A
F1B3: 8EF1E2 14Tc - 249: CALL @STINPB | 3B| 4x14= 56Tc| 4x: call subroutine to store A into the input buffer
F1B6: C3 5Tc - 250: INC B | 1B| 4x05= 20Tc| 4x: bump B to point the the next char
F1B7: 5D04 7Tc - 251: CMP %>04,B | 2B| 4x07= 28Tc| 4x: test loop counter, if 4: DONE (zero flag updated)
F1B9: E6F5 5Tc 07Tc 252: JNZ LF1B0 | 2B| 3x07= 21Tc| 3x: JNZ_/ jump back to loop on counter = {1,2,3}
(+jnz drop through) | - | 1x05= 5Tc| 1x: JNZ \ drop through on counter = {4}
|===| =====|
|19B| 196Tc| TOTAL RUN-TIME
|___|___________| (not including subroutine duration)
The code above is easy to understand because its similar to other CPU instructions and the descriptions in the right margin can also help. The 'TOTAL RUN-TIME' of the code is summed in the box. Example: '4x13= 52Tc' means this instruction is done 4 times in the loop construct. Each time this instruction is executed, it takes 13Tc units of time and the total that that instructiong is 52Tc.
Tc is a shortened version of Tc(C) which is the internal state cycle period, or how long one cycle takes based upon (1) the frequency of the crystal or external clock source and (2) the CTS chip's internal divide-by-2 or divide-by-4 masked-circuit. The Data Manual offers this example: With a 5Mhz crystal and a divide-by-2 internal circuit, it has an internal frequency of 2.5Mhz. The period of the internal frequency is its reciprical, 400ns per Tc(C). In the code above, 196Tc * 400ns = 78.4 us (micro-seconds).
Line# 248 is the most unusual instruction in the code. Its going to load Register A with a value from an addressed table (STROK) using Register B as an index or offset relative to the table address, to get the correct data/character. While we don't know the specifics of why Line# 245 used a particular value, nor what specifically subroutine STINPB* does, other than stick a single character into the input buffer, we can understand this simple piece of code. *STINPB will be covered in a later posting about CTS Buffers.
An Introduction to "Register Files"
Line# 245, does a logical AND operation on the contents of R10. That is how most of the CPU internal RAM, called "Register File", is referenced. The original CTS chip has 128 bytes of RAM, known as R0 through R127 using decimal numbering. Register A and B are R0 and R1, respectively. Registers A and B are used by instructions intended to execute faster; these often use a set of opcode byte that are micro-coded to use register A or B, instead of using a full byte for another operand. Reducing operands is one way to speed up instructions; one less fetch.
An interesting feature of the TMS7000 CPU is that any two consecutive registers, (Rp-1,Rp) such that Rp does not equal zero, can also be used as a 16-bit value or pointer at any time. The 16-bit register pair such as R5:6 is designated by the least significant byte (LSB) R6 (MSB:LSB). Example, the third line of code executed upon a hardware RESET is:
F003: 8820002D 14: MOVD %>2000,R45 ; loads the external address of the 'SP0' into register pair R44:45
The opcode is a MOVD (move double, i.e. 16-bit value) using an immediate value of hex 2000 and store it into register pair R44:45 (MSB:LSB). There is no rule about the LSB being ODD or EVEN. The only case they warn about is using something like "MOVD %>2000,R0" as the MSB would be Rp-1 or negative 1. That would conceptionally wrap around to R127:0 and the CPU may or may not support the wrap-around situation.
The CTS code has ODD and EVEN register pairs:
F0F8: 9C31 152: BR *R49 ; ODD, Branch (jump) to a 16-bit address stored in R48:49
F170: DB34 212: DECD R52 ; EVEN, 16-bit decrement instruction on the value in R51:52 (no 16b INCD)
F1D4: 9A2F 269: LDA *R47 ; ODD, Reads the parallel port input using the address in R46:47
The last thing to say about CPU internal RAM is that >00xx is its 16-bit address range. The CPU can access internal RAM and control ports using a 16b pointer: LDA *R7 or STA *R7. As its a block of 128 or 256 RAM bytes in various TMS7000 CPUs, Note that the upper address is always >00 when addressing internal RAM as by a 16-bit pointer address. This means you could use 8b buffer constructs and just keep the MSB of the pointer cleared to zero.
For Efficiency and Speed, Use 8-bit Buffer Constructs Where Possible
The algorithm to convert Text-to-Voice is more than a FIFO buffer. When focus starts of one letter, the buffer will be scanned for letter groupings that might be a pattern for one group allophone. This means that more time is spent in the buffer that is used for Text-to-Voice and some scans will be wasted efforts looking until it can abandon looking further. Due to this nature of multiple scanning for letter pattern scenarios, this buffer should be carefully constructed for efficiency and speed. Whether the special buffer is run in internal RAM or external RAM, it can be constructed to use less than 256 locations. Separating the Algorithm buffer to 16 or 32 bytes would allow it to run quickly. All the math is 8-bit, and the buffer boundaries can be forced by incrementing or decrementing the 8-bit pointers and using a logical AND to remove any out of range values; an automatic wrap-around in the buffer; as one example.
The CTS implementation uses 16-bit buffers and has a flag table to configure its acrobatic algorithm scans described above. The buffer code has to read those flags again when asked to do something. Its big overhead and slows things down. When there is no designated external RAM, the two buffers are created in internal RAM. It uses a 16-bit construct (because it might run a big buffer in external RAM with the same code, if its configured?) The method is to bump a 16-bit pointer then compare it to the over-range value in a 16-bit register pair. If they equal, the buffer needs to wrap-around and another stored 16-bit value is then written over the pointer to make it wrap around. Too much activity when simplicity would be easier to code.
To make matters worse, the CTS code in one of the pattern testing routines, uses the multiply instruction, which run in the range of 44Tc to 49Tc (MPY %>02,B runs in 46Tc) whereas the longest non-Multiply instructions is 17Tc. A x2 multiplication can be done in binary math as a 'RLC B' at 5Tc if the Carry flag is already cleared. Worst case is 'CLRC @ 6Tc' + 'RLC B @ 5Tc = 11Tc.'
In fairness to the original CTS coders, they had a schedule to get the code working so they could sell the CTS. They accomplished that requirement. In addition, poor CPU documentation of an 'interesting' instruction, BR LABEL(B) is poorly documented in all three versions of the CPU manual. CTS coders hacked a routine that worked and moved on using a x3 to jump to a table of 3b BR (Branch) instructions. It works.
In 2024, we have different project constraints and anyone can look at the code and modify it.
This would be a good modification of codespace for 2024+. :)
An Introduction to Assembling TMS7000 Machine Code:
Take another look at the F003: machine code. It illustrates the general way an instruction is assembled into machine code.
F003: 8820002D 14: MOVD %>2000,R45
^^^^^^^^ ^^^^ ^^^^ ^^^
The first byte of the machine code of an instruction is the opcode. The TMS 7000 CPU has several instruction opcodes that work for one particular operand syntax. MOVD has three unique opcodes, one for each operand syntax. For example:
Opcode Mnemonic Syntax Bytes Operands Machine Code Comments
88 MOVD %>iop,Rd 4 16b_iop,8b_dst 88.mm.ll.dd --.mm.ll.-- are the 16b immediate operand in MSB.LSB order
A8 MOVD %>iop(B),Rd 4 16b_iop,8b_dst A8.mm.ll.dd --.--.--.dd is the 8b hex value of the Rd decimal value.
98 MOVD Rs,Rd 3 8b_src,8b_dst 98.ss.dd --.ss.dd are 2 8b hex values of the Rs,Rd decimal values.
From an assembler perspective, you identify the operand field syntax to identify the which instruction Opcode applies. Then you place the operands in the next bytes of machine code in the order the are listed in the operand field, left to right. Be aware that references to Register A or B are usually designated in the opcode byte, so they are generally *NOT* in an additional operand byte of machine code. Keeps it simple and faster. One exception is an assembler instruction that moves values between A and B, but its technically not an instruction, its of the syntax A,Rx or B,Rx and the assembler is expected to convert the destination by Register Name (A or B) to its associated Rx name (R0 or R1 respectively). If I write an assembler I'll dig and tabulate those exceptions.
However, note that the "%>iop(B)" operand does NOT need to specify the Reg B (R1) because, (1) there are not enough bytes in the machine code to explicitly reference B, and the A8 opcode syntax obviously makes it an implied operand. Opcode A8 uses B, there is nothing variable about that. So the machine code A8.mm.ll.dd has no room for A or B explicitly; that would require A8.mm.ll.ss,dd where ss is 01 for R01 aka Register B.
There is a file in the ZIP-File_01.zip that has these machine code constructs: The_Method_of_Assembling_Machine_Code.lst
When this makes sense to you, you'll have a natural method to assemble machine code without looking through the tedious explanations, mostly in the 1983 TMS7000 manual. If you want to write an assembler, then I suggest you read the manual too.
Yes, I used "Rs,Rd" instead of the manual's "Rp,Rp" for register pair. This adds clarity that the source is always/usually listed first and the latter operands tend to be destinations. Using "Rp,Rp" doesn't help the novice know which register is which.
Say "OK" a Faster Way:
Being a fan of the Z80, I started to write a TMS7000 "DJNZ version" using Reg B as the counter and the index too. I was surprised when I used Linux GREP to see what instructions the CTS coders used: they didn't use any DJNZ!?!? However, looking at the loop "4x...=" over an over, I thought I should try a straight drop-through LOAD&CALL routine for speed and length comparison. I wrote the following code, with a TRAP n instruction which I'll explain later; for now... think of it as a short-cut, 1-byte CALL that isn't any faster that a 3-byte call but a feature that allows you to recover a lot of codespace when you have subroutines called from many places in the code.
ADDR MCODE Time ifJP LINE# LABEL OPCODE OPERANDS BYTES REALTIME Line by line description
===== ======== ==== ==== ===== ====== ====== ========== ===== =========== ========================
___ ___________
F1A8: 73F90A 9Tc - 245: SAYOK AND %>F9,R10 | 3B| 1x09= 9Tc| set some flags for the buffer...
F1AB: 224F 7Tc - 246: MOV %>4F,A | 2B| 1x07= 7Tc| put an "O" into Register A
F1AD: EE 14Tc - 247: TRAP 6 | 1B| 1x14= 14Tc| TRAP 6 is a 1-byte CALL to STINPB
F1AE: 222D 7Tc - 248: MOV %>2D,A | 2B| 1x07= 7Tc| put an "-" into Register A
F1AF: EE 14Tc - 249: TRAP 6 | 1B| 1x14= 14Tc| TRAP 6 is a 1-byte CALL to STINPB
F1B0: 224B 7Tc - 250: MOV %>4B,A | 2B| 1x07= 7Tc| put an "K" into Register A
F1B2: EE 14Tc - 251: TRAP 6 | 1B| 1x14= 14Tc| TRAP 6 is a 1-byte CALL to STINPB
F1B3: 220D 7Tc - 252: MOV %>0D,A | 2B| 1x07= 7Tc| put an ASCII 'CR' character into Reg A
F1B5: EE 14Tc - 253: TRAP 6 | 1B| 1x14= 14Tc| TRAP 6 is a 1-byte CALL to STINPB
... | | |
FFF2: DATA STINPB |+2B| | CALL address in table, +2B
|===| =====|
|17B| 93Tc| TOTAL RUN-TIME (Twice as fast)
|___|___________| (not including subroutine duration)
Note: During initialization, execution speed is less important. When initialization is completed and the CPU has some real-time constraints to perform its function, then speed coupled with efficient algorithms and processing is important.
LESSON??? Could it be... Sometimes the best Real-Time code is writing like it was straight-line BASIC!?!? In a way... YES. I'd re-phrase that to say that the better lesson is to be careful of your assumptions. Spend a little time to verify assumptions. When the unexpected is the best solution, its good to be the one that figured that out.
Next Time...
An introduction to the ZIP_File_01.zip, A useful collection of Tables, Charts and Notes about CTS and TMS7000 CPU.
Coming Soon...
How switching the most used subroutines CALLs to TRAP n instructions saves 122 bytes of CTS codespace.
[Updated on: Tue, 11 March 2025 11:05] Report message to a moderator
|
|
|
Current Time: Sat Mar 15 01:00:47 PDT 2025
Total time taken to generate the page: 0.00614 seconds
|