Mfakto

From Prime-Wiki
Jump to: navigation, search

A GPU based Mersenne number factoring program. mfakto (mersenne number faktoring with openCL) is a port of mfaktc (mersenne number faktoring with CUDA) (for use on ATI/AMD GPUs rather than NVIDIA).

History

  • First announced 07-Jun-2011 by Bdot.
  • First public release (version 0.07) 15-Aug-2011
  • Current version 0.14

Behavior

mfakto can use both CPU and GPU resources to perform factoring. When started without any option, it will select a suitable GPU, read program settings from a file called mfakto.ini, run a short self-test and then read a factoring task from a work definition file (default: worktodo.txt).

Since version 0.13, the sieving can be moved to the GPU as well. This results in very little CPU usage (only for driving the GPU).

mfakto accepts the following command line options:

-h|--help              display a short option overview and exits
-d <xy>                use OpenCL platform number x and device number y
-d c                   force using all CPUs
-d g                   force using the first GPU
-v <n>                 verbosity level: 0=terse, 1=normal, 2=verbose, 3=debug
-tf <exp> <min> <max>  trial factor M<exp> from 2^<min> to 2^<max> and exit instead of parsing the worktodo file
-i|--inifile <file>    load <file> as inifile (default: mfakto.ini)
-st                    run built-in selftest (about 1,500 testcases) and exit
-st2                   run built-in selftest (all 33,000 testcases) and exit

mfakto understands these options for debugging purposes:

--timertest            run test of timer functions and exit
--sleeptest            run test of sleep functions and exit
--perftest             run performance test of the sieve and other parts, then exit
--CLtest               run test of some OpenCL functions and exit. Specify -d before --CLtest to test the specified device

mfakto's operation can be configured using these settings in the inifile (mfakto.ini):

  • Verbosity: same as the -i commandline switch. -i overrides Verbosity
 Minimum: 0  (terse)
 Maximum: 3  (debug)
 Default: 1  (normal)
  • SievePrimes: Defines how far the factor candidates (FCs) are pre-sieved on the CPU. The first <SievePrimes> odd primes are used to sieve the FCs. (relevant for CPU sieving only)
 Minimum: SievePrimes=5000
 Maximum: SievePrimes=200000
 Default: SievePrimes=25000
  • SievePrimesAdjust: Set this to 1 to enable automatical adjustments of SievePrimes during runtime based on the "average wait times". 0 uses the fixed value above. (relevant for CPU sieving only)
 Default: SievePrimesAdjust=1
  • SievePrimesMin: If SievPrimesAdjust=1, then SievePrimesMin defines the lower limit for SievePrimes. Lower values may reduce CPU consumption. Higher values enforce a higher TF efficiency at the cost of more CPU resources. Lowering this value below 5000 is not advised as that dramatically reduces the quality of the siever output, leading to more futile work of the GPU. (relevant for CPU sieving only)
 Minimum: SievePrimesMin=256
 Maximum: SievePrimesMin=999999
 Default: SievePrimesMin=5000
  • SievePrimesMax: If SievPrimesAdjust=1, then SievePrimesMax defines the upper limit for SievePrimes. Lower values reduce main memory consumption (8k per 1k SievePrimesMax). Higher values allow for higher sieving if enough CPU resources are available. (relevant for CPU sieving only)
 Minimum: SievePrimesMax=5000
 Maximum: SievePrimesMax=1000000
 Default: SievePrimesMax=200000
  • SieveSizeLimit: Define the block size of the sieve in kiB. For small SievePrimes values, the size of the L1 cache is a good choice, starting at SievePrimes ~ 100000, bigger values are more efficient. Use mfakto --perftest to test your CPU for the best values. Intel CPUs often have 32 kiB L1 cache, AMD 64 kiB, Bulldozer 16 kiB. This value is used only, if SIEVE_SIZE_LIMIT was not set to a fix value at compile time. (relevant for CPU sieving only)
 Minimum: SieveSizeLimit=12
 Maximum: SieveSizeLimit=2000   (not enforced, but maximum useful value)
 Default: SieveSizeLimit=64
  • NumStreams: Set the number of data sets used by mfakto. NumStreams must be >= 1. In this case mfakto can process one data set on the GPU while the CPU preprocesses the next one. When NumStreams is >= 2 than the time needed to upload (CPU->GPU transfer) the data sets can be hidden (if the hardware supports it, but OpenCL does not yet support this). A greater number increases the memory consumed by mfakto (host and GPU memory). The current limit for the number of streams is 10. (relevant for CPU sieving only)
 Minimum: NumStreams=1
 Maximum: NumStreams=10
 Default: NumStreams=3
  • VectorSize: Set the number of factor candidates that a single GPU-thread will work on in parallel. This increases the execution unit utilization but requires more registers. When more space is needed than available in registers, then (slow) scratchpad memory (global memory) will be used. On most hardware this will start happening at vector size 8. On a HD6870, a vector size of 4 is fastest except for the barrett92 and barrett72 kernels which run slightly faster with vector size 8. On GCN cards however, (77xx, 78xx, 79xx) VectorSize 2 is fastest.
 Allowed sizes are 1, 2, 4, 8, 16.
 Default: VectorSize=4
  • GridSize: The GridSize affects the number of threads per grid. Depending on the number of multiprocessors of your GPU, too, the automatic parameter threads per grid is set to:
  GridSize = 0: threads per grid =  131072
  GridSize = 1: threads per grid =  262144
  GridSize = 2: threads per grid =  524288
  GridSize = 3: threads per grid = 1048576
  GridSize = 4: threads per grid = 2097152

A smaller GridSize has more overhead than a bigger GridSize for long running jobs. For really small jobs there can be a small benefit on computation time if the GridSize is small. A smaller GridSize directly reduces the runtime per kernel launch and might result in a better interactivity and may be required for a stable UI on slower GPUs. (relevant for CPU sieving only)

 Minimum: GridSize=0
 Maximum: Gridsize=4
 Default: GridSize=4
  • Workfile: the name of the file which contains the factoring assignments.
 Default: WorkFile=worktodo.txt
  • ResultsFile: the name of the file to which the factoring results will be written.
 Default: ResultsFile=results.txt
  • Checkpoints: Checkpoints are needed for resume capability. After a class is finished a checkpoint file can be written. When mfakto is interrupted during the run (by pressing ^C) and restarted later it will begin at the last processed class.
 Checkpoints=0: disable checkpoints
 Checkpoints=1: enable checkpoints
 Checkpoints=n, n>1: write a checkpoint after n classes have been tested
 Checkpoints=961 (or bigger): never write a checkpoint except when ^C is used to abort mfakto.
 
 Default: Checkpoints=1
  • CheckpointDelay: The minimum time in seconds between two checkpoint writes.
 Minimum: CheckpointDelay=0   (write a checkpoint after each class)
 Maximum: CheckpointDelay=3600
 Default: CheckpointDelay=300
  • Stages: Allow to split an assignment into multiple bit ranges. Enabled Stages make only sense when StopAfterFactor is 1 or 2.
 0 = disabled
 1 = enabled 
 Default: Stages=1
  • StopAfterFactor: Defines what happens when a factor is found.
 0: Do not stop the current assignment after a factor was found.
 1: When a factor was found for the current assignment stop after the
    current bitlevel. This makes only sense when Stages is enabled.
 2: When a factor was found for the current assignment stop after the
    current class.

 Default: StopAfterFactor=1
  • PrintMode: Defines the screen output format.
 0: print a new line for each finished class
 1: overwrite the current line (more compact output)

 Default: PrintMode=0
  • V5UserID and ComputerID: if specified, then the result lines in the results file will have the prefix "UID: user/host, " - the same way as prime95 does it.
 Default: none (unset)
  • TimeStampInResults: Allows to configure if each result line should be preceeded with a date-and-time stamp (similar to prime95)
 Default: TimeStampInResults=0
  • ProgressHeader: here you can specify a fix string that is displayed as a header to the progress line without any modification.
 Default: ProgressHeader=   done |    ETA |     GHz |time/class|    #FCs | avg. rate | SieveP. |CPU idle
  • ProgressFormat: allows to customize the progress output of mfakto. You can use any combination of the following format specifications, which will be replaced correspondingly in the progress line:
 %C - class ID (n/4620)            "%4d"
 %c - class counter (n/960)        "%3d"
 %p - percent complete (%)         "%6.2f"
 %g - GHz-days/day (GHz)           "%7.2f"
 %t - time per class (s)           "%6.0f"/"%6.1f"/"%6.2f"/"%6.3f"
 %e - eta (d/h/m/s)                "%2dm%02ds"/"%2dh%02dm"/"%2dd%02dh"
 %n - number of candidates (M/G)   "%6.2fM"/"%6.2fG"    for CPU sieving, this is the number of sieved candidates passed to the GPU, otherwise the raw (unsieved) number of candidates
 %r - rate (M/s)                   "%6.2f"   again, number of sieved / raw candidates per second
 %s - (GPU-)SievePrimes            "%7d"
 %w - CPU wait time for GPU (us)   "%6lld"
 %W - CPU wait % (%)               "6.2f"
 %d - date (Mon nn)                "%b %d"  (strftime format)
 %T - time (HH:MM)                 "%H:%M"  (strftime format)
 %U - username (as configured)     "%15s"   no fixed width, but truncated to 15 chars
 %H - hostname (as configured)     "%15s"   no fixed width, but truncated to 15 chars
 %M - the exponent being worked on "%d"     no fixed width
 %l - the lower bit-limit          "%2d"
 %u - the upper bit-limit          "%2d"
 Default: PrintFormat=%p% | %e | %g |  %ts | %n | %rM/s | %s | %W%
  • AllowSleep: allow the CPU to sleep if nothing can be preprocessed? Not currently used by mfakto, it is always waiting for an event to signal GPU thread completion.
 0: Do not sleep if the CPU must wait for the GPU
 1: The CPU can sleep for a short time if it has to wait for the GPU
 Default: AllowSleep=1
  • GPUType: Different GPUs may have their best performance with different kernels. Here, you can give a hint how to optimize the kernels.
 Possible values:
 GPUType=AUTO             try to auto-detect, if that does not work: let me know
 GPUType=GCN              Tahiti et al. (HD77xx-HD79xx)
 GPUType=VLIW4            Cayman (HD69xx)
 GPUType=VLIW5            most other AMD GPUs (HD4xxx, HD5xxx, HD62xx-HD68xx)
 GPUType=APU              all APUs (C-30 - C-60, E-240 - E-450, A2-3200 - A8-3870K) not sure if the "small" APUs would work better as VLIW5.
 GPUType=CPU              all CPUs (when GPU not found, or forced to CPU), also used for unknown devices.
 GPUType=NVIDIA           reserved for Nvidia-OpenCL. Currently mapped to "CPU" and not yet functional on Nvidia Hardware.
 GPUType=INTEL            reserved for Intel-OpenCL (e.g. HD4000). Not yet functional.
 Default: GPUType=AUTO 
  • SieveCPUMask: Allow to set the CPU affinity of the siever thread. It is a bit-mask, specified in decimal: CPU0=1, CPU1=2, CPU2=4, CPU3=8 ... e.g. 0=no limit, 5=CPU0+CPU2, 15=CPU0-3, 18446744073709551615=CPU0-63 (max value) (relevant for CPU sieving only)
  • SieveOnGPU: move the sieving to the GPU. This will free most of the CPU resources
 0: Sieve on the CPU
 1: Sieve on the GPU
  • GPUSievePrimes: defines how far we sieve the factor candidates on the GPU. The first <GPUSievePrimes> primes are sieved. This parameter also influences the shared mem requirements (lower sieving requires more shared memory)
 Minimum: GPUSievePrimes=54
 Maximum: GPUSievePrimes=1075000
 Default: GPUSievePrimes=111157
  • GPUSieveSize: defines how big of a GPU sieve we use (in M bits). Higher is usually faster, but the screen may lag easier.
 Minimum: GPUSieveSize=4
 Maximum: GPUSieveSize=128
 Default: GPUSieveSize=96
  • GPUSieveProcessSize: defines how far many bits of the sieve each TF block processes (in K bits). Larger values may lead to less wasted cycles by reducing the number of times all threads in a warp are not TFing a candidate. However, more shared memory is used which may reduce occupancy. Smaller values should lead to a more responsive system (each kernel takes less time to execute). GPUSieveProcessSize must be a multiple of 8.
 Minimum: GPUSieveProcessSize=8
 Maximum: GPUSieveProcessSize=32  (requires GPUSievePrimes > 310)
 Default: GPUSieveProcessSize=24
  • MoreClasses: is a switch for defining if 420 (2*2*3*5*7) or 4620 (2*2*3*5*7*11) classes of factor candidates should be used. Normally, 4620 gives better results, but for very small classes 420 reduces the class initialization overhead enough to provide an overall benefit. Used only when sieving on the GPU; the CPU-sieve will always use 4620 classes.
 MoreClasses=0 (use 420 classes)
 MoreClasses=1 (use 4620 classes)
 Default: MoreClasses=1
  • FlushInterval: Newer AMD drivers cause high CPU load when many kernels are scheduled on the GPU. To avoid useless CPU cycles (busy wait), <FlushInterval> kernels will be scheduled at most, then mfakto will wait for the completion of half of them until the next kernels are scheduled. Higher values will start consuming CPU, lower values may not put full load on the GPU. Lowering FlushInterval can also help reduce screen lag. Only use when sieving on the GPU.
 FlushInterval=0 means disable this feature (schedule all kernels as fast as possible).
 Default: FlushInterval=8.
  • UseBinfile: Mfakto can save the compiled kernels avoiding the need to recompile the kernels at every start. This setting defines the file name to use. If not set, mfakto will always recompile the kernels. Note that mfakto does not detect when the kernel files have changed (disable this setting or delete the cache file to force recompilation). The file consists of one line containing the build options, after that comes the compiled kernel as delivered by the driver (intended to be binary, but NV issues assembly). The AMD binary comes in an ELF format.
 Default: UseBinfile=mfakto_Kernels.elf
 if empty, always recompile

Options for --perftest

  • TestSieveSizes: a list of different SieveSizes to be tested with the CPU sieve. Comma-separated (no spaces) list of multiplicator values of ~12kB (which is the internal sieve chunk size of 13*17*19*23 bits) 30 values at most,
 no default (= skip the CPU sieve test)
  • TestSievePrimes: for each of the above SieveSizes, which SievePrimes values should be tested. Comma-separated (no spaces) list of SievePrimes values. The SievePrimes values must be in the range 256..1000000. Up to 8 of those fit on a 80 chars line. Used for both the CPU as well as the GPU sieve test. 30 values at most.
 no default (= skip CPU and GPU sieve test)
  • TestGPUSieveSizes: the same for the GPU sieve, given in M bits, in the range 4..128 Will be used together with the above TestSievePrimes to test the GPU sieve. 30 values at most
 no default (= skip the GPU sieve test)

Options for debugging purposes

  • OCLCompileOptions: Overrides the default compile options for the kernels
 default: not set (use options according to other settings, like -I. -DVECTOR_SIZE=%d -DMORE_CLASSES)
 possible values (example): -g / -O3, -DCHECKS_MODBASECASE, -DCL_GPU_SIEVE, -DSMALL_EXP
 Other useful settings: -save-temps for keeping .il and .isa files

Hardware requirements

A device supporting OpenCL version 1.1 (or newer), currently AMD/ATI cards of the series

  • R9, R7 xxx
  • HD7xxx
  • HD6xxx (including the GPU part of AMD's APUs)
  • HD5xxx
  • HD4xxx
  • FireStream 92xx

Software requirements

  • Windows 64-bit
  • Windows 32-bit
  • Linux 64-bit

Additional required software:

Major differences to mfaktc

  • In order to improve utilization of the AMD GPUs, mfakto can process multiple factor candidates in one GPU-thread. This is accomplished by using OpenCL-vectors, preferred vector size is 4, although in newer AMD GPU's (GCN and some APU's) 2 performs best.
  • There is no 95-bit kernel yet, therefore mfakto supports factoring only up to factor sizes of 92 bit.
  • Mfakto only supports splitting the factor candidates into 4620 classes when sieving on the CPU (mfaktc can be compiled for 420 classes). This allows for using a bigger grid size of up to 2097152 factor candidates per GPU kernel call. When sieving on the GPU, the ini file option MoreClasses=0 will switch to 420 classes.

License

The program is released under the GNU/GPLv3 license.

Resources

Source code:

Official website: (None)

Current version (0.14, 2014-04-17) on MersenneForum:

Contacting the author: PM on MersenneForum to "Bdot" is preferred.

See also