1.Introduction Giventheincreasingdemandformunications,cryptographicalgorithmswillbeembeddedinalmosteveryapplicationinvolvingexchangeofinformation. Modulararithmeticoperationssuchasdivisionandmultiplicationoverfinitefields(mostlyGF(p)andGF(2n)),arewidelyusedinseveralcryptographicapplications.Ingeneral,public-keycryptographicalgorithmsrequiresahugeamountputations,therefore,thereisanincreasingdemandfordedicatedhardwaretoelerateputations.AnalgorithmandhardwareimplementationthatisableputemodulardivisionandmultiplicationinbothGF(p)andGF(2n)isdefinitelyadvantageous. Themodularmultiplicationoperationwithalargemodulusisveryimportantinmanypublic-keycryptosystems.Montgomerymultiplicationalgorithm[11]isanefficientwayputemodularmultiplication.Authorsin[5]proposedaunifiedalgorithmputeMontgomerymultiplicationinbothGF(p)andGF(2n).Ontheotherhand,modulardivisionandmodularinversearetimeconsumingoperationsandtheycannotbepletelyinpracticalapplications.Forinstance,theyareusedindecryptionofElGamal[7]public-keycryptosystem.Researchershaveclaimedthatagaininperformancecanbeobtainedwhentheyareimplementedinhardware[1,6]. TheUnifiedmodularDivision/MultiplicationAlgorithm(UDMA)proposedinthisputesmodulardivisionandMontgomerymultiplicationinbothGF(p)andGF(2n)fields.TheUDMAisbasedontheExtendedBinaryGCDalgorithm[2,12]for modulardivisionandontheMontgomery’smethodformodularmultiplication.Forthedivisionoperation,theproposedalgorithmusesacountertokeeptrackofthedifferencebetweenintegersinGF(p)orpolynomialdegreesinGF(2n),reducingplexityofeachalgorithmiterationbyeliminatingtheneedparisonsoffieldelements.Othermodulardivision(inverse)algorithmshaveintegerandpolynomialparisons[6,8,9]aspartoftheircontrolflow. Thehardwarearchitecturefortheunifieddivider/multiplierthatimplementstheUDMAefficientlysupportsallputationsinthealgorithmandusesCarrySaverepresentationsothatalladditionsandsubtractionsaredoneinconstanttime,independentoftheoperandsprecision. ThefollowingSectionpresentssomegeneralconceptsandmathematicalnotationusedinthispaper.TheUnifiedmodularDivision/MultiplicationAlgorithm(UDMA)isproposedinSection3.TheanizationofthehardwaredesignthatimplementstheUDMAanditsmainfunctionalblocksareshownanddescribedindetailinSection4.ExperimentalresultsareshowninSection5,followedbyconclusionsinSection6.
2.GeneralConceptsandNotation TheelementY(x)∈GF(2n)isapolynomialofdegreelessthannandhascoefficients inGF
(2).Inthispaper,weconsiderpolynomialbasisfortheelementsofGF(2n).On theotherhand,theelementY∈GF(p)isanintegerintheset{1,...,p−1},wherepisa n-bitprimemodulusintherange2n−1
X,
Y,
p
and
Field.
The
computation
Z
=
XY
mod
p
is
performed
when
Field
=
GF(p),
and
Z(x)
=
X(x)Y(x)
modp(x)isdonewhenField=GF(2n).
In
bothcases,Y=
0.ThemodularinversecanbeobtainedbyforcingX=
1.Noticethat
thedivisionorinversetakesthesameamountoftime.
•UnifiedMontgomeryModularMultiplicationAlgorithm:Montgomeryintroducedan
efficientalgorithmtocalculatemodularmultiplication[11].AMontgomeryalgorithm
tohandlemultiplicationinGF(p)andGF(2n)finitefieldswasintroducedin[5].The
Montgomerymultiplicationalgorithmgeneratestheproductoftwon-bitintegersXand
Yinmodulopas:MM(
X,Y)=XYr−1modp,wherer=2nfor2n−1
3.UnifiedAlgorithmforModularDivisionandMontgomeryModularMultiplication. TheUnifiedmodularDivision/MultiplicationAlgorithm(UDMA)isstatedasfollows: Function:ModularDivisionandMultiplicationinGF(p)andGF(2n) Inputs:0≤X
Z
=XY2−n
mod
p
when
Op=mult,
Z
=
XY
mod
p
when
Op=div.
Algorithm:
C=
Y.
IFOp=multTHEN
/*MultiplicationMode*/
D=
0,U=
0,W=
X,δ=n
ELSE
/*DivisionMode*/
D=p,U=
X,W=
0,δ=
0
ENDIF;
WHILE[(C=0ANDOp=div)OR(δ=0ANDOp=mult)]
IFc0=0THEN
C:=C1
δ:=δ−
1
/*IntegerOperation*/
ELSE
k=
1
IF(Op=div)THEN
IFδ<0THENC⇔
D,U⇔
W,δ:=−δENDIF;/*Swapping*/
IF((C+D)mod4=0ANDField=GF(p))THENk=−
1
ELSEδ:=δ−1ENDIF;
ELSE
/*Op=mult*/
δ:=δ−
1
ENDIF;
C:=(C+k∗D)
1,U:=(U+k∗W)
ENDIF;
U:=(U+u0∗p)
1
ENDWHILE;
IFOp=divTHENZ:=
W
ELSEZ:=
U
ENDIF;
Tothebestofourknowledge,thisisthefirstalgorithmthatintegratesputationofmodulardivisionandMontgomerymodularmultiplicationinbothGF(p)andGF(2n).TheUDMAmodeofoperationiscontrolledbyinputOp(divormult),andthefinitefieldiscontrolledbytheinputField(GF(p)orGF(2n)).Forsimplicity,thepolynomialsX(x),Y(x),andp(x)aredenotedasX,Y,andp,respectively,whichcorrespondtothebit-vectorrepresentationofthesepolynomials.
Mostofthearithmeticoperationsinthealgorithmmontobothmodesofoperation.Theinitializationofvariablesdependsontheoperation.Foragivenfield,alltheadditions/subtractionsaredoneinthefield,besidesthearithmeticoperationsonδ(subtractionsandchangeofsign)whicharealwaysintegeroperations.
ThealgorithmintegratestheExtendedBinaryGCDalgorithmandtheMontgomerymultiplicationalgorithmanditwasverifiedusingMaple.puteMontgomerymultiplicationusingann-bitmodulusp,UDMAperformsniterations.Thecounterδisinitializedwithvaluen,andineachiterationitisdecrementedbyone.Thevariablesusedinthealgorithmareinitializedas:C=
Y,D=
0,U=0,andW=
X.Theresultisready(Z=U)whenδ=
0.ThepartialproductUisreducedmodpineachiteration.Inbothfields,whilemultiplying,additionisusedintheoperationsthatupdateCandD(k=1).Theoperatorindicatesa1-bitrightshiftoperation.
TheputesmodulardivisionusingthesamestructureusedbytheExtendedBinaryGCDAlgorithmformodulardivision[2].Thevariablesareinitializedas:C=
Y,D=p,U=
X,W=0,andδ=
0.IfthedivisionputedinGF(p),UDMAteststheleastsignificanttwobitsofCandD((C+D)mod4=0)toconditionallysubtractsCfromD(setk=−1).Otherwise,CisalwaysaddedtoDinbothfields.ThedivisionpletedwhenC=0,andthefinalresultisavailableinW.MoredetailsabouttheoperationoftheUDMAarepresentedinthefollowingSection.
4.OverallHardwareSystemArchitecture
Figure1showslevelarchitectureoftheUnifiedModularDivider/Multiplier(callitUMDM)thatimplementstheUDMA.ThemainfunctionalblocksareRegisterFile(RF),Datapath,andControl.Moredetailsabouttheseblocksareprovidednext.
Op
Field
n
Control
333
n
input(
X,Y,P)
Load
2vectors2n
indstsrc1src2out1
RegisterFile(RF)
Y
(2vectors)
2nA
out2(2vectors)B
2n
Load
UMDMDatapath
Sum/Carry
(2vectors)2n
Figure1.Toplevelhardwarearchitectureoftheunifiedmodulardivider/multiplier(UMDM).
4.1.RegisterFile
Theregisterfilehasfiveregisters(R1toR5).SinceputationsaredoneinCarry-Saveform,eachintermediatevariable(
C,U,
D,W)isrepresentedintwovectors(sum,carry).So,theregistersinsidetheregisterfilearedesignedtostoretwon-bitvectors.Inotherwords,theithregisterRiisrepresentedasRi=(sum,carry)=(Ris,Ric).
Theregisterfilehasoneinput,andtwooutputports.TheControlblockprovidestheregisterfilewiththesignalsnecessarytoperformreading/writingoperations.The3-bitsignaldstdeterminesthedestinationregistertobewritten.Thesignalssrc1,src2(3-bitseach),specifytheregisterstobereadatoutputportsout1,out2,respectively.
4.2.UDMADatapath
Then-bitdatapathimplementingtheUDMAisshowninFigure2.Each“whileiteration”ofthealgorithmisimplementedin1clockcycleformultiplicationmode,and
in3clockcyclesfordivisionifCisodd,and2clockcyclesifCiseven,asexplainedlater.
As
Ac
Bs
Bc
LoadYshiftY
FSEL
a
b
c
NEGCin
UnifiedCarry-SaveAdder1withComplement(UCSA1)
Y
LSBitofU(u0)
complementer
Yshifter
(c0)
control
FSEL
AND
UnifiedCarry-SaveAdder2(UCSA2)
Cin
result_shiftershshift
Sum
Carry
Nsel_zero
Figure2.Unifieddatapathofthemodulardivider/multiplier(UMDMdatapath).
Theproposeddatapathhastwoinputsrepresentedincarry-saveformasA=(As,Ac)andB=(Bs,Bc)whichreceivetheirvaluesfromtheregisterfileports(out1,out2),respectively.Theponentsofthedatapatharetwo(3-2)UnifiedCarry-SaveAdders(UCSAs)whicharesimilarplexitytofull-adders[5].TheunifiedadderscanperformbitadditionwithorwithoutcarrydependingontheinputFSEL(FieldSelect).
Theunifiedaddermaybeusedtoimplementaredundantornon-redundantadder.Theuseofnon-redundantformoftheoperandsandresultsreducestheregisterareabutincreasestheadditiontime(becauseofcarrypropagation).WedecidedtouseCarrySaveadderstomaketheadditiontimeconstantandindependentoftheoperand’sprecision.
4.2.1.UnifiedCarry-SaveAdders:ThefirstadderinthedatapathisaUnifiedCarry-SaveAdderplement(UCSA1).Figure4.ashowsthebitslicediagramforthisadderandFigure4.bshowstheconnectionofnslicestoformann-bitadder.TheUCSA1outputsare:(sum,carry)=a+b+cwhenNEG=Cin=0,and(sum,carry)=a+b−cwhenNEG=Cin=
1.AdditionandsubtractioninGF(2n)arethesame.
NEGcbaFSEL
(a)bitslice
sumcarry
an-1-
1
FSEL
a1b1c1
FSEL
a0
FSEL
b0c0
NEG
UCSA11-bit
UCSA11-bit
carryn
sumn-1carryn-
1
sum1carry1
(b)n-bitadder
UCSA11-bit
Cinsum0carry0
Figure3.Unifiedcarry-saveadderplement(UCSA1)for1-bitandn-bitprecision.
AswecanseefromFigure3,thechangeofsignoperationinsidetheUCSA1isdoneinparallelwiththeadditionoperation,andso,itdoesnotaddtothecriticalpath.
TheANDgateshownatofUCSA2inFigure2isusedtoselectbetweenthevalue0orthemoduluspdependingwetherUisevenorodd(thisisdeterminedbytestingu0),respectively.TheinputsignalselzerowhenassertedforcesthethirdinputoftheUCSA2tozero.
ThedelayoftheUMDMdatapath(tdatapath)isdeterminedbythedelayofthetwoUCSAs,thedelayoftheANDgatebetweenthem,andthedelayoftheresultshifterwhichhasadelayof2-inputmultiplexer(tMUXtXOR).ByintegratingtheANDgatewiththesecondadder(shownindashedboxinFigure2),itsdelaywillnotaddtothepathdelay.KnowingthateachUCSAhasadelayofafulladder(tFA=2tXOR),weget:
tdatapath=tUSCA1+tUCSA2+tresultshifter=4tXOR+tMUX=5tXOR
TheYshiftershowninFigure2isashiftregisterusedtoimplementtheoperation(C>>1)onlyinthemultiplicationmode.ItisloadedwiththeinputY(multiplier)whenlaodY=
1.WhenshiftY=
1,Cisshiftedrightby1-bit(rememberthatC=Y).TheleastsignificantbitoftheshiftedCgoestothecontrolsectiontobetested(c0=0).
Thedatapathoutputs(Sum,Carry)areshiftedright1-bitbycorrectwiringusingtheresultshifterattheoutputoftheUCSA2.Thismoduleisimplementedastwo2-inputmultiplexerswithselectlineshift.Whenshift=1,theshiftedoutputsareselected.
4.3.ControlBlock
ThissectiondescribestheControlblockandthesystemoperationduringdivisionandmultiplicationmodesandthedesigntechniquesusedtoimplementtheproposedUMDM.
WhenLoad=1,theregistersinsidetheregisterfileareinitialized(bytheuser)withtheinputsX,Y,andpdependingontheoperationtobeperformedbythealgorithm.
Whilethealgorithmisindivisionmode,thetestC=0mustbedone.ThevectorCisinCarrySaveform.SimulationresultsshowthatC=4or−4attheendputation.Ifmoreiterationsthanrequiredareperformed,theresultisstillcorrect.IntheproposedUMDMdesign,thetestC=0isreplacedbytestingthebitvectorCforspecificvalues(4or-4),andnoneedpareallthebitsofCwithzero.
4.3.1.MultiplicationMode:Theproposedmultiplier/dividerperforms1iterationofthealgorithmineachclockcycleputingMontgomerymultiplicationinbothfields.Onlythreeregistersareused.Table1showstheloadingphaseformultiplicationanddivision(initializationoftheintermediatevariables)asdescribedbythealgorithm.
Table2showstheoperationoftheunifieddivider/mulitplierwhenperformingMontgomerymultiplicationinGF(p)orGF(2n).Itshowsthemaincontrolsignalsappliedtotheregisterfileandthedatapathputemultiplication.ThevaluesofthesignalswhicharenotmentionedintheTableareequaltozero.DependingonC(evenorodd),adifferentsetofsignalsareused.TheTablealsoshowstheinterpretationofthesesignalsonthedatapath,andthecorrespondingoperationintheUDMA.Whenc0=1,twoadditions(U:=(U+k∗W)andU:=(U+u0∗p)>>1)areperformedbythedatapath.
ThevariableCisshiftedrightonebitusingtheYshifterineverycycle(shiftY←1).Thecounterδisinitializedwithntoindicatethenumberofiterations.Sinceδisdecre-
Table1.Loadphaseformultiplicationanddivision
Multiplication
Division
Variables/parametersLoadingtheRFrepresentedbyRR←(Rs,Rc)
Variables/parametersLoadingtheRFrepresentedbyRR←(Rs,Rc)
Wandp
R1←(X,p)
W
R1←(X,0)
U
R2←(0,0)
U
R2←(0,0)
p
R3←(0,p)
D
R3←(p,0)
Variable
LoadingtheShifter
C
R4←(Y,0)
C
YShifter←(Y)
p
R5←(0,p)
Table2.TheoperationoftheUMDMduringMontgomerymultiplication
ControlSignals
c0=0(then)DatapathInterpretation
ControlSignals
c0=1(else)DatapathInterpretation
src1←(010)src2←(011)dst←(010)shiftY←1shift←1decdelta←
1
(As,Ac)←(Us,Uc)(Bs,Bc)←(0,p)storetheresultinUC>>1theresult>>1δ=δ−
1
src1←(010)src2←(001)dst←(010)shiftY←1shift←1decdelta←
1
(As,Ac)←(Us,Uc)(Bs,Bc)←(X,p)storetheresultinUC>>1theresult>>1δ=δ−
1
Computes:
U:=(U+u0∗p)>>1U:=(U+k∗W),U:=(U+u0∗p)>>
1
mentedbyoneineachiteration(decdelta=1),putationendswhenδ=0,andtheresultcanbereadformR2.
4.3.2.DivisionMode:EachmodulardivisioniterationintheproposedUMDMarchitecturerequires2clockcyclesifCisevenand3clockcyclesifCisoddinbothfields.TheinitializationofvariableswasshowninTable1.
LetusassumethatthealgorithmputingmodulardivisioninGF(p).Tables3and4showthecontrolsignalsduringdivisionwhenCisevenandodd,respectively.Table3showswhenδispositive,theexpressionU:=(U+u0∗p)>>1putedinthefirstclockcycle.Cisshiftedrightby1-bit(C>>1)inthesecondcycleinordertoenablethetestforc0=0whichdecidesonthecontrolsignalsofthenextiteration.
Table3.TheUMDMoperationputingdivisionwhenc0=
0
c0=0(then)
State1:Original
State2:Swapped
ClockControl
Datapath
Control
Datapath
CycleSignals
Interpretation
Signals
Interpretation
cycle1cycle2
src1←(001)src2←(101)dst←(001)shift←1src1←(100)selzero←1dst←(100)shift←1decdelta←
1
(As,Ac)←(Us,Uc)(Bs,Bc)←(0,p)storeresultinR1(U)result>>
1
U:=(U+u0∗p)(As,Ac)←(Cs,Cc)(Bs,Bc)←(0,0)storeresultinR4(C)result>>1δ=δ−
1
C:=(C)>>
src1←(010)(As,Ac)←
src2←(101)(Bs,Bc)←
dst←(010)storeresult
shift←
1
result>>
>>1puted
src1←(011)selzero←1dst←(011)shift←1decdelta←11puted
(As,Ac)←(Bs,Bc)←storeresultresult>>δ=δ−
1
(Ws,Wc)(0,p)inR2(W)1(Ds,Dc)(0,0)inR3(D)
1
ClockCyclecycle1cycle2cycle3
Table4.TheUMDMoperationduringdivisionwhenc0=
1
c0=1(else)
State1:Original
State2:Swapped
Control
Datapath
Control
Datapath
Signals
Interpretation
Signals
Interpretation
src1
←(001)(As,Ac)←(Us,Uc)src1←(010)(As,Ac)←(Ws,Wc)
src2←(010)(Bs,Bc)←(Ws,Wc)src2←(001)(Bs,Bc)←(Us,Uc)
dst←(001)storeresultinR1(U)dst←(010)storeresultinR2(W)
shift←
0
resultisnotshiftedshift←
0
resultisnotshifted
decdelta←∗,*=1whenTEST=0,otherwise*=
0.(δ=δ−1)
U:=(U+k∗W)puted
src1←(001)(As,Ac)←(Us,Uc)src1←(010)(As,Ac)←(Ws,Wc)
src2←(101)(Bs,Bc)←(0,p)
src2←(101)(Bs,Bc)←(0,p)
dst←(001)storeresultinR1(U)dst←(010)storeresultinR2(W)
shift←
1
result>>
1
shift←
1
result>>
1
negatedelta←∗,*=1whenδ<0,otherwise*=
0.(δ=−δ)
U:=(U+u0∗p)>>1puted
src1←(100)(As,Ac)←(Cs,Cc)src1←(011)(As,Ac)←(Ds,Dc)
src2←(011)(Bs,Bc)←(Ds,Dc)src2←(100)(Bs,Bc)←(Cs,Cc)
dst←(100)storeresultinR4(C)dst←(011)storeresultinR3(D)
shift←
1
result>>
1
shift←
1
result>>
1
C:=(C+k∗D)>>1puted
TheControlduringdivisionhastwomajor”states”:Original(notswapped)andSwapped.Thecontrolgoesbackandforthbetweenthesestatesonlywhenc0=1andδ<
0.Thesignofδismadepositive(δ=−δ)whenthisconditionhappens(realizedbythecontrolsignalnegatedeltaasshowninTable4).WhenthecontrolstateisOriginal,thevariablesW,
U,D,CarereadfromtheregistersR1,R2,R3,R4,respectively.Also,theupdatedvaluesofthesevariablesarewrittentotheaboveregistersinthesameorder.WhenthecontrolstateisSwapped,thesourcesforthevariablesWandUarereversed(WisreadfromR2andUfromR1).SamethingappliesforCandD(CisreadfromR3andDfromR4).Thesameapproachisusedforwritingtothesevariables.Thisway,theswapoperationisrealizedbyreadingfrom/writingtothecorrectregister.
Thevalueofk∈{-1,1}isdeterminedbythecontrolsectiondependingontheresultofthetest((C+D)mod4=0ANDField=GF(p)),whichisdenotedby(TEST)inTable4.Inthecasek=−1,negativeDandnegativeWareobtainedbysettingN=
1.SincetheTESTisdoneontheleasttwosignificantbitsofCandD,itcanbeimplementedusingatwo-levelwork.
IfthealgorithmputingthemodulardivisioninGF(2n),thesameproceduredescribedaboveisfollowed,exceptthattheTESTisnotapplicable(Field=GF(2n)).Forbothfields,putationpletedwhenC=0,andtheresultisZ=
W.
5.ExperimentalResultsandComparisons
Thissectionhastwocategoriesofexperimentalresults:(a)numberofiterationsobtainedfromaMaplemodeland(b)thecriticalpathdelayresultsobtainedbysynthesisoftheVHDLdescriptionofthealgorithm.
5.1.TheNumberofIterations
Maplewasusedtodescribetheproposedalgorithm(Alg1=UDMA)andtheunifiedMontgomeryinversealgorithmpresentedin[6](Alg2).Atleast100randomsampleswereusedtoverifyeachalgorithmoperationandobtainstatistics.
Theotherdivision(inversion)algorithmparedistheunifiedMontgomeryinversealgorithmpresentedin[6](Alg2).Forconsistency,nomultiple-wordcalculationisconsideredhere.Forann−bitinputY,putesZ=Y−12k,wheren≤k≤2nisthenumberofalgorithmiterations.AcorrectionstepisneededtogettheinverseintheMontgomerydomain(Y−12n)orintheintegerdomain(Y−1).Therefore,thetotalnumberofiterationsrequiredputetheinverseinMontgomerydomainis2k−n.putethemodularinverseintheintegerdomainitneeds2kiterations.
parisonswereusedinAlg2parethesizeofthebitvectorsthatrepresentelementsinthefield(thesamewayitwasdonein[1,8,9])insteadofthecounter(δ)usedinouralgorithm.parisonlimitsafasthardwareimplementation.
Figure4showsthenumberofiterationsasafunctionofoperandsizerequiredbyAlg1(UDMA)andAlg2putetheintegermodularinverse(division)inGF(p)andGF(2n).Thesizeoftheoperandsisintherange(160to512)bits.Fromfigure4,wecanseethatAlg1executesinabout25%lessiterationsthanAlg2putingtheinverseintheintegerdomainforGF(p).ForGF(2n),Alg1hasabout40%lessiterationsthanAlg2putingtheinverseintheintegerdomain.NoticethatthenumberofiterationsforAlg1inbothfieldsincreaselinearlywiththeoperandprecision.
Iterations
18001600140012001000
800600400200
00
100
200
300
400
Operandsize(bits)
Alg1GF(p)Alg2GF(p)Alg1GF(2^n)Alg2GF(2^n)
500
600
Figure4.ThenumberofiterationsasafunctionofoperandsizerequiredbyAlg1(UDMA)andAlg2(presentedin[6])putethemodularinverseinGF(p)andGF(2n)
J.Goodmanetal.presentedin[8]apublic-keyprocessorthatimplementsoperationsrequiredforEllipticCurveCryptography(ECC)includingmodularinverseinGF(2n)only.Itisstatedin[8]thattheinversioninGF(2n)takesonaverage3.3cyclesforeachbit.Alg1needsamaximumof2iterations/bitandonaverage1.5iterations/bitputethemodularinverseinGF(2n).TheUMDMarchitectureperformseachiterationofAlg1in2.5clockcyclesonaverage.Therefore,theGF(2n)inversionbyAlg1takesonaverage1.5×2.5=3.75cyclesforeachbit.Butontheotherhand,thecriticalpathdelay(clockperiod)ofthedatapathoftheprocessorpresentedin[8](callittGµP)isthedelayof2fulladders(tFA)and2levelsofmuxesandanANDgate
(tGµP=2tFA+2tmux+tAND),paredtoacriticalpathdelayofonly2fulladdersand1levelofmuxesfortheUMDMdatapathproposedinthiswork(tUMDM=2tFA+tmux).Thecriticalpathdelaywillaffecttheputationaltimeaswillbeseenlater.
putesMontgomerymodularmultiplicationinncycles,whichisconsistentwithseveralotherdesigns[5,8,11].
5.2.SynthesisResults
TheexperimentaldatapresentedinthissectionweregeneratedusingMentorGraphicsCADtools.ThetargettechnologywassettoAMI05fastauto(0.5µmCMOSwithhierarchypreserved)providedintheASICDesignKitfromthepany[3].
TheUMDMarchitecturewasdescribedinVHDLandsimulatedinModelSimforfunctionalcorrectness.ItwassynthesizedusingLeonardotoolforthementionedtechnology.
Figure5showsthecriticalpathdelays(innano-seconds)oftheUMDMfortheprecisionrange(160-512)bits.TheFigureshowsthatthedelayincreaseswiththenumberofbitsanditsaturatesathigherprecision.Thisindicatesthatthecriticalpathdelay(clockperiod)oftheUMDMbecameindependentoftheoperandssizeathighprecisions.
13
Time(nanoseconds)
120
100
200
300
400
500
600
Operandsize(bits)
Figure5.ThecriticalpathdelayoftheUMDMinnano-seconds(operandsizefrom160-512bits).
Thepublic-keyprocessorpresentedin[8](GµP)runsatclockrateof50MHz(clockperiod=20nsec),anditisconsideredagoodrepresentativeofthisclassofhardwaredesigns.Thedivider/multiplierproposedinthisworkhasamaximumclockperiodof12.8nsecat512-bitoperandsize,whichisabout1.6timesfasterthanGµ
P.Also,asmentionedbefore,theGµPtakes3.3cycles/bittoperformamodularinverseinGF(2n)andthisdesigntakes3.75cycles/bit.LetthetotalputationtimeofagivendesignTdesign:
Tdesign=(cycles/bit)∗n∗clockperiod
wherenistheoperandsizeinbits.Atn=512-bits,theputationtimeof
GµP(TGµP)is:TGµP=3.3∗512∗20×10−9=33.79µsecandtheputationtimeforthisdesign(TUMDM)is:TUMDM=3.75∗512∗12.8×10−9=24.57µsec.By
comparingTGµPandTUMDM,wefindthattheUMDMproposedinthisputes
division
in
33.79−24.5733.79
=
27.3%
less
total
computation
time
than
the
Gµ
P.
Ontheotherside,theareafortheUMDMdesignwasextractedfromtheexperimental
dataas:AUMDM=236.12∗n+180=O(n)gates.
Thearchitecturepresentedin[4]isdedicatedtoGF(2n)only,anditusesparisonsofthefieldpolynomials.ThedesignhasanplexityofO(n)too.
6.Conclusion
InthisworkwepresentaUnifiedmodularDivision/MultiplicationAlgorithm(UDMA)anditscorrespondingarchitecture.ThealgorithmisabasedontheBinaryGCDalgorithmformodulardivisioninbothfields,andontheMontgomerymodularmultiplicationalgorithm.TheputesthedivisionineitherGF(p)orGF(2n)fieldsinanefficientwayparedwithotheralgorithms.Itusesacountertokeeptrackofthedifferencebetweenthevaluesofelementsinthefield,reducingplexityofeachiterationandmakingitfasterthantheiterationofotheralgorithmsthatuseparisons.paredwithMontgomerymultipliers,theproposedalgorithmhasthesamenumberofiterationsplexity.
TheUnifiedDivider/Multipliermodulethatimplementsthealgorithmefficientlysupportsallitsoperationsandusescarry-saveunifiedaddersforreducedcriticalpathdelay.
Experimentalresultsshowthatputationtimeoftheproposedsolutionpetitivewithotherdedicated(limited)solutionsandthecostinareapaidtohavetheintegrationofdivisionandmultiplicationisnothigh.Multiplicationisdonealmostasfastasinadedicatedmultiplierwiththeaddedfunctionalityofdivisionandtheflexibilitytochoosetherequiredfinitefield.
AcknowledgementsThisworkwassupportedbytheNSFCAREERgrantCCR0093434andbytheJordanUniversityofScienceandTechnology(JUST).
References
[1]
A.A.Gutub,
A.F.Tenca,
E.SavasandC.K.Koc.Scalableandunifiedhardwareputemontgomeryinverseingf(p)andgf(2n).InProc.CHES2002,pages484–499.
[2]
A.F.TencaandL.A.Tawalbeh.AnAlgorithmforUnifiedModularDivisioninGF(p)andGF(2n)SuitableforCryptographicHardware.IEEElectronicsLetters,40
(5):304–306,March2004.
[3]ASICDesignKit.MentorGraphicsCo./partners/hep/AsicDesignKit/dsheet/ami05databook.html.
[4]
A.D.DaneshbehandM.A.Hasan.AUnidirectionalBitSerialArchitectureforDouble-BasesDivisionoverGF(2m).InIEEE16thSymposiumonComputerArithmetic.
[5]
E.Savas,
A.F.TencaandC¸.K.Ko¸c.AscalableandunifiedmultiplierarchitectureforfinitefieldsGF(p)andGF(2m).InProc.CHES2000,pages281–296.
[6]
E.SavasandC.K.Koc.Architecturesforunifiedfieldinversionwithapplicationsinellipticcurvecryptography.InThe9thIEEEinternationalconferenceonElectronic,Circuitsandsystems,2002.
[7]
T.ElGamal.Apublickeycryptosystemandsignitureschemebasedondiscretelogarithms.IEEETrans.-InformationTheory,IT-31
(4):469–472,July1998.
[8]
J.GoodmanandA.P.Chandrakasan.Anenergy-efficientreconfigurablepublic-keycryptographyprocessor.IEEEJournalofsolid-statecircuits,36(11):1808–1820,November2001.
[9]
J.Wolkerstorfer.Dual-fieldarithmeticunitforgf(p)andgf(2n).InCHES2002,pages500–514.[10]
D.Knuth.TheArtofComputerProgramming,Volume2,SeminumericalAlgorithms.3rdedition,
1998.[11]
P.L.Montgomery.Modularmultiplicationwithouttrialdivision.MathematicsofComputation,
44(170):519–521,April1985.[12]
N.Takagi.AVLSIAlgorithmforModularDivisionBasedontheBinaryGCDalgorithm.IEICE
Trans.fundamentals,E81-A
(5):724–728,May1998.