WPE-Notes

Slide 1

  • Modern application rely heavily on data, posing demand for faster processing of massive amounts of data

  • Rate at which memory performance is increasing is less compared to that of processing speed. This gap between the supply and demand is referred to as the ‘Memory Wall’

  • The performance trend of processors versus memory shows that memory is often the bottleneck

  • What are some of the possible solutions?

    • Change the architecture to achieve higher bandwidth. As we saw performance does not seem to increase at a great rate. HMC was introduced around 2011
    • Do computation on near memory so that we don’t incur significant data movement

Slide 2

  • How is HMC is different from traditional DRAM architecture?
    • Traditional DRAM is a 2d matrix whereas HMC is 2d wafers stacked one on top of the other
    • Each cube is divided into sub-sections called vaults.
    • Each vault has its own memory controller through which we can access the data
    • Each of the vaults are connected over crossbar network
    • This enables us to achieve higher bandwidth, as accesses can be parallelized

Slide 3

  • As HMC and HBMs provide greater bandwidth than traditional DRAM, people thought could we use it for caching DRAM. This poses more challenges which include
    • Where to store TAGs for cache?
      • Accessing TAGs would be a huge bloat to the performance if TAGs are stored in HMCs
      • There are some proposals where they store TAGs in SRAMs, this aditional storage as HMC sizes span several gigabytes
    • Then next proposals in this vein include partitioning the HMCs to favor frequently used to reside inside HMC
    • One method proposes to load page to HMC on page fault if it was previously accessed, other proposes switching the group memory based on page reuse

Slide 4 (Tesseract)

  • One of the applications that is data intensive is Graph Processing. Dominantly used in most of web apps such as social networks, searching and ranking etc.,
  • graph traversal is essentially a random walk through memory, where memory latency cannot be easily overlapped with computation.
  • Tesseracts aims at accelerating graph based applications such as page rank by embedding in-order cores inside each of the vaults of the HMC so that graph processing parallelized.
  • Challenging task here is that while processing a list of graph nodes, we might need to reference PageRanks of the successors that are not in memory to update the rank, how can we achieve better parallelism ?
    • Tesseract Supports Requesting function calls to the neighboring vaults in a non-blocking fashion, it follows MPI style implementation to achieve parallelism.
    • Another optimization that is done here is adding support for prefetching by embedding hints in the messages.

Slide 4 (IMPICA)

  • another type of app that involves random walk is pointer chasing. Information retrieval can involve significant data movement.
  • IMPICA, targets LinkedList, BTree traversal.
  • Traditional technique for pointer chasing would involve a lot of compulsory misses as these data structures do not pose a good locality.
  • IMPICA also embeds inorder cores to take the request, translate the address and fetch data,
  • ……. you go on.

Slide 5 (ApproxPIM)

  • Now that we have seen apps with in-order cores –> These works added traditional processor cores into the memory to perform local processing. Other work realized that this might be difficult because of Process Limitations and cost of manufacturing. Any cpu design typically needs 10-12 metal layers while DRAM uses less than 5 to keep the cost low.
  • HMCs support limited atomic vector operations per vault which include ADD, SUB, COMPARE, SWAP, AND, NAND and XOR. However, it lacks MULT and DIV.
  • This paper implements the MULT and DIV in an approximate way using SHIFT, ADD and COMPARE.
  • Test some of the banchmark applications like BFS, K-means and KNN

…… you go on

Slide 13 (Memory Bandwidth for AI)

  • Another area where acceleration is needed is MachineLearning.

  • As we know memory bandwidth is a critical resource, most of the Machine learning applications are memory bound. This is can assessed from the plot where the x-axis shows the operations/byte and y-axis shows the throughput.

  • We’d want applications to have higher Operational intensity but most of the apps are not.

  • This becomes more renounced in the deeper models, lstms, rNN models. (Next Slide)

  • Another challenge for today’s machine learning is the power limitation, alogorithms are trained on huge GPUs, TPUs and are deployed smaller embedded targets like watch, hearing aid etc.

  • What are some of the ways we can limit the data transfer?

    • a good regularizer modifies the objective function of model and is likely to find a model with lesser params. Hence, we have a smaller model.
    • Prune a DL network
    • Quantize weights/Activations so that overall footprint is reduced as DL applications are robust quantization noise to some extent.
  • Imagine running a full fledged inference model on a always On device like cellphone that detects ‘Ok Google’.

    • Ideally we’d want the embedded device to be running in LP mode, this is bloat to the energy efficiency
  • Inorder to have a low energy footprint, try to fit the model in SRAM, but SRAM is usually limited in size. To tackle this problem BinaryEYE stores binary weights. This not only reduces storage but also load on ALU. As multuplies can now be just XNOR gates.

  • BinarEYE uses 64 Neurons and surrounded by 256K SRAM containing weights. Each neuron is sub-divided into 4 neurons to make it

DRISA

  • We saw methods for putting processor core in memory, then using existing processing logic in memory for computation; This work proposes a new DRAM cell organization that can be used to perform computation with the memory itself … you go on.

Morpheus

  • So far we discussed discussed processing data in memory, can we extend the idea to storage?
  • Storage likes flatenned data, like this example where an Address object is serialized to XML: to read the data and parse it to form corresponding objects it can take significant amounts of time.
  • Traditional flow involves …. you go on

FUTURE WORK

  • In this talk, I’ve discussed several areas where work is being done to try to address the memory wall problem, but through my survey of this work, I have discovered areas where there are significant opportunities to make further strides.
  • Machine Learning:
    • Rethink representation interms of quantization schemes vs accuracy of models.
  • Frameworks for quantized training.