Monday, 24 November 2014

Os basics

system call --->
    1) o.s provides their services via system call API
    2) system call API are implemented using system libraries and
      system call interface layer of o.s
    3) system call API are depend on interrupt handling mechnism of h/w and o.s
system is computing system and we can say here kernel is subsystem of computing system .

types of subsystems --->
    1) process management
    2) phy and virtual memory subsystem
    3) file and logical file manegement subsystem 
    4) i/o subsystem
    5) device drivers
    6) interrupt subsystem
    7) IPC subsystem
    8) timer subsystem
    9) many more

    subsystem are used to manage the components ( same name with manager eg. interrupt manager ).
types of components --->

    1) s/w components
    2) h/w components

but here o.s is made up of many s/w componenets that manages h/w resources and logical resources.

work of s/w components --->

    1) used to provide the environment ( developer and controlled ) and funtionality
    to the application s/w components
    ( assembler/compiler, system libraries or GUI interface )

    2) initialize the h/w resources  ( device drivers and i/o subsystem )

    3) manage the h/w resources  ( i/o subsystem , file system manager,
    physical/virtual memory manager, cpu scheduler ).

    4) provide interface between user and computing system ( system libraries, system utilities )

s/w components are of 2 type core and non-core components-->

core components -->  used for manage and initialize the h/w and provide services to the non-core
    components , application and user ( non-core- libraries , utilities and execution framework
    that enable the user/admin to interface and utilize the services of o.s 1 and 4 in above case).
    ( core - device driver, i/o subsystem ) (2 and 3 in above case )

-->non-core components are residing in user space.

-->all components are not standalone components i.e they are depend on each other
-->all components are co-exists in layer and hierarchical form.

-->o.s is called as layer and modular model
    layer --> 1 components are depend on each other and flow of control / commands /data
        is strict  ( higher to lower level )
    modularity -->core features can be added and remove as needed ( modules )

-->some of the layers can interact very closely with hardware controller called HAL layer ( 1 type of
component )
-->i/o subsystem and device driver are 2 they are using HAL to do work with h/w
-->HAL is in machine language ( particular or architecture specific ) and device driver is in
high level language.

-->for particular arch/Board HAL is different so if o.s having particular HAL for particular Board
on that arch/board that o.s will be loded

back to system call --------------->

system call interaface layer will invoke the trap handler when system call API is invoked
and as the rule of system call API it invoke appropraite system call system routine in appropriate
subsystem .

process management subsystem having IPC , scheduler and memory management

may be schduler can take as seperate subsystem and can interact directly HAL layer
( hardware control )( ex. context switch )
top most layer of memory management called as process/vitual management layer and lower 1
called as physical memory manager.

same with file subsystem layer.
device management having device driver and i/o subsystem


user space and system space are divided as memory regions and some attributes and these attributes
setting are taken care by system and processor rules.
for user space may be this attribute is not set.

user space and system or kernel space is nothing is the memory regions containing kernel
or user components.

user space inlcude ( non core component of o.s and application ).
kernel include ( core components having kernel data structure ).
and those memory regions are non-contigous but put together.

-->priviledge and non priviledge mode are controlled via processor's control registers.

-->these division user and system mode or less or more priviledge mode are done by some
hardware techniques   ( processor is divide into layers ( data / instruction flow )
and if less priviledge mode than it will block it in some layer)
( like here for system memory it will block the layer for user coming instruction ).

all table like term will be stored in the kernel data structure ( i.e memory )
system interrupt handling --> interrupt management

-->interrupt subsystem is there that will take care of interrupt table and it is very archtecture
specific and managed with the help of HAL.
-->for each interrupt type/no there is particular handler, o.s will point or setup the
 control register of processor to point to the interrupt handler table base address.

-->suppose h/w timer interrupt came all other h/w interrupt will be disabled swicth will happen
 cpu ( less to more pv mode ) stack switch ( usr to system ).
-->stack switch will be done with the help of h/w and several register will be modified and will
 save into the system stack.
-->cpu will execute the jump instruction and startexecution of the paticular handler and 
if handler completes its job it will execute the special instruction and this will reverse
back all the instruction taken.

-->cpu will return back to that executed instruction  and resumes execution normally.

--> special jump is similar to h/w interrupt but it is a machine instruction implemented by
interrupt mechanism.


process management

about process --> having pd ( process descriptor ) ( unique and describing several resource
& information )

    1) process id
    2) process state
    3) scheduling parameter ( policy , time slice and details )
    4) system stack address ( run time context address )
    ( system stack wil store at the time of blocked, waiting, system call jump execution )
    5)maintain address of active file of this process.
    6) maintain memory descriptor ( virtua and physical memory location of the process )
    7) field which help system to put the process in ready queue ( 1 for ech processor )
    or wait queue ( 1 for each resources ) ( 1 for each i/o channels )
    8) credentials

if resources are there for process , the process will be in ready queue and if not in wait queue.

when the process is running in the processor it can be prempted by higher priority or due to end
of time slice.( time slice exhausted ) and added to ready queue not in wait queue
( in case of premption ) 

process creation -->
    1) new pd will be created in pd table.
    2) new pid will be given to process.
    3) resources and object allocated and added to pd
    4) that pd will be added to the 1 or more list processmanagement subsystem ( ipc,
    memory management, scheduler )

resources of the process will be there till termination.
normal termination-->
-->if a process attempts illegal activity ( as i told processor is also divided in layers to
execute the priviledge instruction ) in this case processor will generates the h/w exception
like a interrupt and handled by exception handler stored with the trap handler and interrupt handler .

-->exception is generated when bypassing system call and trying to access core services of o.s.

now process terminate so scheduler will be invoked may be beacuse of system call API termination
( termination system call ) or exception handler will invoke the scheduler.

may be due to time slice ending  ( time slice exhausted )


system call API and system call system routine    -->

system call API are implemented using user space system libraries but system call system routine is implemented in different components of o.s.

they are intracting by one to one mapping.

system call API is needed to invoke the system call system routine.

system call API can be implemented using interrupt mechanism.

use of system call API --->
    1) process creation and termination
    2) thread creation and termination
    3) i/o access
    4) memory allocation and deallocation
    5) syncronization
    6) data exchange
    7) file input/output access
    8) many more 


when system call API is invoked there is particular entry in system call libraries.
each system call lib is encoded in  assembly language and execute the trap instruction
( system call lib contains ) which when execute cause an intrrupt results in switch from user
to kernel mode.

process will call lib as it call any function. create stack for lib functions. when trap instr
executed register will besaved, mode will change and kernel stack will generate.


buffer allcation ----> for figure MJB book

--> buffer pool is  the part of the kernel data stucture and initialize at the time of system start
buffer pool is depend on the memory size and system performation.

--> every disk is logicaly  detected by the kernel
i.e every filesystem ( unit number ) is transaleted in logical device number by the disk driver
that device number is used by kernel for every file system access

suppose we have 2 file system ext2 and ext3
then disk driver differentiate those drives by logical  device number  ( drive c, drive d in windows )
each logical device having blocks treated as block number

in bufferpool if we have 2 file system then we have 2 buffer list ( hash queue ) and free list.
buffer pool is only 1 but buffer list may be more

buffer header having buffer header and data array pointer.

we have hash queue and free buffer list ( both are circular doubly linked list )

--->when disk block is requested then process will use getblk algo and will search into the buffer cache
if itis there then return immediately other wise kernel will call the disk driver to schdule for read request
and goes to sleep for event I/O completion.disk driver will notifies the disk controller  hardware that it
want to read data and disk controller will transfer the data into buffer.

in free list case --> using recently accessed buffer
the recently free buffer put at the end( tail ) of the list or at the head of the list ( in case of error
( I/O  access error, hardware access problem ( in these case disk controller will not inform CPU by
giving intrruptand returned back to free list )) ) not in the middle . 

1 disk block only allocated to one particular buffer at a same time.

put buffer in hash queue is same 
hashqueue has make by using combination of device number and block number so
for put the disk block into the memory we want some searching acc. to the device number and block number
hash value in the hash queue if location found we will put there or will make location there by some mechenism.

we will get the block detail ( block location is there in hash queue or not ) from the hash queue and
for that block we have to find buffer from the free list.

if /dev/sdb2 and /dev/sdb3 are there then 2 buffer list ( hash queue ) is there and 2 free list.

buffer may be in hash queue and free list at same time .

buffer allocation is the case of system call only, not user space access

when disk  block is accessed then first it will check into the hash queue first then free list

read and write access -->
high level modules ( file sub systenm ) will take care of fast access so it will take not only 1 but
2 disk block from the disk at same time ( time constraints ) ( read ahead algo )
1st will be syncronous that means the disk driver will read the data block into buffer ( will be in sleep
for I/O completion on 1st block ) and will not wait for 2nd. when 2nd will complete disk controller
will give intruupt to CPU that second was asyncronous and will release the buffer.( for other process use )
( like delayed write case ).
same with write operation

asyncronous and delayed write-->
in asyncrounous it will not wait for I/O completion but in delayed  write itwill do as soon as possible and
make that as old so it will add to head not to tail.

at time of access if kernel request for the second buffer then it will checkfirst asyncrounous read buffer
and will provide that buffer to the block location.

we have 5 scnerios for buffer allocation

1) suppose i access the disk block from the drive c ( logicaly or /dev/sda2 in unix )

suppose i requested the 7th block data of device /dev/sdb2 so kernel high level algo will serach 7th block
and /dev/sdb hash value and search that in the hash table of particular f.s.

soppose it find the block location for alocation in the hash queue ( 7th block ) and its buffer is free
then it will allocate that buffer at that particular location and data manipulation on that buffer.

applicable to all scnerio --->
buffer will put on busy ( because other process can access that buffer if free )
after completion it will release buffer so other process who is waiting for that buffer to free or process
waiting as there was no free buffer will wakeup and take that buffer.

2) suppose i access the disk block from the drive c ( logicaly or /dev/sda2 in unix )

suppose i requested the 7th block data of device /dev/sdb2 so kernel high level algo will serach 7th block
and /dev/sdb hash value and search that in the hash table of particular f.s.

not find the location to put buffer for that block so it will take the buffer from free list and will make
the location for that block allocated buffer in hash queue.

3) block location is not there ( scnerio 2 ), it will get the buffer from the free list but free list
getting buffer is "delayed write" so asynchronous write for the delayed write buffer into the disk
and release delayed write buffer for other and will serach for other free buffer in free list.if again delayed
write then same process.

4) block location is not there not also free list is empty
it will wait for any free buffer.

5) block location is there but buffer is busy
that means somebody is using that disk block as i told you 1 block == 1 buffer otherwise data will currupt
so itwill wait for that buffer to free.
it will not assign that block dffrent buffer.
here race condition for free block can occur --->
as procee A is using that buffer and process B wants to use that buffer so Process B will go to sleep
but When process complete its task Process C access that because of waiting free buffer so it will
manipulate that may be in different mod list so when process B will wake up itwill check buffer but
buffer is not there so it will take new buffer from free list.

Find which boot loader you are using

sudo dd if=/dev/sda bs=512 count=1 2>&1 | grep GRUB - For GRUB
sudo dd if=/dev/sda bs=512 count=1 2>&1 | grep LILO - For LILO
sudo dd if=/dev/sda bs=512 count=1 2>&1 | grep ACRON - For Acronis
sudo dd if=/dev/sda bs=512 count=1 2>&1 | grep RED - For RedBoot

If Matches then output will looks like -
Binary file (standard input) matches

Friday, 27 June 2014

How to write PERL module in simple and easy way?

Link :

Below is my code -
 Package name -


require Foo;

Foo::bar( "a" );
Foo::blat( "b" );
yatendra@yatendra:~/Research$ cat

package Foo;
sub bar {
  print "Hello $_[0]\n"

sub blat {
  print "World $_[0]\n"
File name -

require Foo;

Foo::bar( "a" );
Foo::blat( "b" );

Create the Perl Module Tree
  h2xs -AX -n Foo

What are Packages?

  • A package is a collection of code which lives in its own namespace
  • A namespace is a named collection of unique variable names (also called a symbol table).
  • Namespaces prevent variable name collisions between packages
  • Packages enable the construction of modules which, when used, won't clobbber variables and functions outside of the modules's own namespace

The Package Statement

  • package statement switches the current naming context to a specified namespace (symbol table)
  • If the named package does not exists, a new namespace is first created.

$i = 1; print "$i\n"; # Prints "1"
package foo;
$i = 2; print "$i\n"; # Prints "2"
package main;
print "$i\n"; # Prints "1"
  • The package stays in effect until either another package statement is invoked, or until the end of the end of the current block or file.
  • You can explicitly refer to variables within a package using the :: package qualifier


For Example:
$i = 1; print "$i\n"; # Prints "1"
package foo;
$i = 2; print "$i\n"; # Prints "2"
package main;
print "$i\n"; # Prints "1"

print "$foo::i\n"; # Prints "2"

BEGIN and END Blocks

You may define any number of code blocks named BEGIN and END which act as constructors and destructors respectively.
BEGIN { ... }
END { ... }
BEGIN { ... }
END { ... }
  • Every BEGIN block is executed after the perl script is loaded and compiled but before any other statement is executed
  • Every END block is executed just before the perl interpreter exits.
  • The BEGIN and END blocks are particularly useful when creating Perl modules.

What are Perl Modules?

A Perl module is a reusable package defined in a library file whose name is the same as the name of the package (with a .pm on the end).
A Perl module file called "" might contain statements like this.

package Foo;
sub bar { 
   print "Hello $_[0]\n" 

sub blat { 
   print "World $_[0]\n" 
Few noteable points about modules
  • The functions require and use will load a module.
  • Both use the list of search paths in @INC to find the module (you may modify it!)
  • Both call the eval function to process the code
  • The 1; at the bottom causes eval to evaluate to TRUE (and thus not fail)

The Require Function

A module can be loaded by calling the require function

require Foo;

Foo::bar( "a" );
Foo::blat( "b" );
Notice above that the subroutine names must be fully qualified (because they are isolated in their own package)
It would be nice to enable the functions bar and blat to be imported into our own namespace so we wouldn't have to use the Foo:: qualifier.

The Use Function

A module can be loaded by calling the use function

use Foo;

bar( "a" );
blat( "b" );
Notice that we didn't have to fully qualify the package's function names?
The use function will export a list of symbols from a module given a few added statements inside a module
require Exporter;
@ISA = qw(Exporter);
Then, provide a list of symbols (scalars, lists, hashes, subroutines, etc) by filling the list variable named @EXPORT: For Example
package Module;

require Exporter;
@ISA = qw(Exporter);
@EXPORT = qw(bar blat);

sub bar { print "Hello $_[0]\n" }
sub blat { print "World $_[0]\n" }
sub splat { print "Not $_[0]\n" }  # Not exported!


Create the Perl Module Tree

When you are ready to ship your PERL module then there is standard way of creating a Perl Module Tree. This is done using h2xs utility. This utility comes alongwith PERL. Here is the syntax to use h2xs
$h2xs -AX -n  Module Name

# For example, if your module is available in file
$h2xs -AX -n Person

This will produce following result
Writing Person/lib/
Writing Person/Makefile.PL
Writing Person/README
Writing Person/t/Person.t
Writing Person/Changes
Writing Person/MANIFEST
Here is the descritpion of these options
  • -A omits the Autoloader code (best used by modules that define a large number of infrequently used subroutines)
  • -X omits XS elements (eXternal Subroutine, where eXternal means external to Perl, i.e. C)
  • -n specifies the name of the module
So above command creates the following structure inside Person directory. Actual result is shown above.
  • Changes
  • Makefile.PL
  • MANIFEST (contains the list of all files in the package)
  • t/ (test files)
  • lib/ ( Actual source code goes here
So finally you tar this directory structure into a file Person.tar and you can ship it. You would have to update README file with the proper instructions. You can provide some test examples files in t directory.

Installing Perl Module

Installing a Perl Module is very easy. Use the following sequence to install any Perl Module.
perl Makefile.PL
make install

Simple steps to automate Android Application test cases using UIAutomator

  • Can be used on device displays with different resolution
  • Events can be linked with Android UI controls. For example, click on a button with text that says “Ok,” instead of clicking a coordinate location (x=450, y=550).
  • Can reproduce a complex sequence of user actions
  • Always performs the same sequence of actions, allowing us to collect performance metrics on different devices.
  • Can run several times and on different devices without changing any Java* code
  • Can use hardware buttons on devices
  • Hard to use with OpenGL* and HTML5 applications because these apps have no Android UI components.
  • Time consuming to write JavaScript*

Ui Automator is an Android Test Framework used for functional Testing, black box Testing. This is an easy to understand and execute. It’s open source. UI Automator identifies the ui element using their Text, Decription, Index, Class Names etc. UI Automator can be used to write Test Cases for any android App either default or the third party apps. This Doesn’t require the Tester to change the signature of the Apps thus can be used for the native apps like Settings.apk
UI Automator is also a JUnit Based Test Suite. So we can use the Assert Class to get the specific test case pass/fail Reason.
The biggest limitation of UI Automator framework is that it only works on API level16 or higher (Android 4.1 or later).

      Set Up for the UI Automator:
1) Android application apk file for Testing. Ex: ApplicationToTest.apk
2) Eclipse for building Test project
3) ADT (Android Development Tools)
4) SDK (Software Development Kit)
If you are using Eclipse for just your android Development then download eclipse from
This includes SDK‘s for android also.
5) JDK (Java Development Kit)
6) Go to the location where your SDK is present Check for android.jar and uiautomator.jar. It should be present under the sdk>Platform> (any Folder with) API level 16 or higher.

Creating the Test Project:

1. Create a new Java project in Eclipse, and give your project a name that is relevant to the tests you’re about to create (for example, "MyAppNameTests"). In the project, you will create the test cases that are specific to the application that you want to test.
2. From the Project Explorer, right-click on the new project that you created, then select Properties > Java Build Path, and do the following:
Click Add Library > JUnit then select JUnit3 to add JUnit support.
Click Add External JARs... and navigate to the SDK directory. Under the platforms directory, select the latest SDK version and add both the uiautomator.jar and android.jar files.
  1. Goto the project create a Package name as per standards and create a class file .
  2. Create New android application project.
And copy paste the below code:
package com.yatendra;
public class writeMsg extends UiAutomatorTestCase{
public void testDemo2() throws UiObjectNotFoundException {
// this create a uiDevice instance an perss on the Home button
//this will create a instance of the “Apps” on the home screen
UiObject allAppsButton = new UiObject(new UiSelector().description("Apps"));
//this will click on the apps and wait for the new window to open
//this will create a ui object of the app tab using the String “Apps”
UiObject appsTab = new UiObject(new UiSelector().text("Apps"));
//This will click on the tab;
//this will create a scroller object which we can use to perform some actions
UiScrollable appViews = new UiScrollable(
new UiSelector().scrollable(true));
//now it scrolls horizontal
//now find the app you want to click using the Text i.e “Messaging” and create a instance of it

UiObject msgApp = appViews.getChildByText(new UiSelector().className(android.widget.TextView.class.getName()),
//now Click on the object found using name
//Now it will opne a new activity identify it using its package name
UiObject msgUI = new UiObject(new UiSelector()
UiObject ui = msgUI.getChild(new UiSelector().className("android.widget.TextView").descriptionContains("New message"));
//check if the window is present or now or give a assert
assertTrue("Unable to find the message Ui ",ui.exists());
if (ui.exists()!= true){
//type the send to number
UiObject sendTo = new UiObject(new UiSelector().className("android.widget.MultiAutoCompleteTextView"));

//here I am handling a pop that sometimes appear
UiObject addToPop = new UiObject(new UiSelector().className("android.widget.EditText").textContains("Add to People"));
if (addToPop.exists()== true){
//find the message text body ,write the message .
UiObject sendTomsg = new UiObject(new UiSelector().className("android.widget.EditText"));
sendTomsg.setText("Hello this i to test Meassaging apk");

Executing and Creating the Test automation JAR:
  1. Using Terminal enter the location where your automation project is present.

Path/yatendra$ android create uitest-project -n yatendra -t android-17 -p /Path/yatendra

  • Here the android-17 is the folder under sdk/platform where the uiautomator and android JARs are present
    • 1st yatendra is my project name
    • This will create the build file .xml
Path/yatendra$ ant build
    • This will create the Jar file .
    • Make sure the ant path is also set properly.
    • If ant is not install : sudo apt-get install ant1.8
    • Or sudo apt-get update or sudo apt-get install ant
    • if ant build showing error like below - use update-alternative --config Java
      2. /home/yatendra/adt-bundle-linux-x86_64-20131030/sdk/tools/ant/uibuild.xml:183: java.lang.UnsupportedClassVersionError: com/sun/tools/javac/Main : Unsupported major.minor version 51.0
    • If ant is showing error like below – use sudo apt-get install ant-optional
Or Check the sdk path and whether u have installed for 32 bit or 64 bit.

      1. Buildfile: build.xml
      3. /home/yatendra/Workspace_android/testAPK/build.xml:90: The following error occurred while executing this line:
      4. /home/yatendra/adt-bundle-linux-x86_64-20131030/sdk/tools/ant/uibuild.xml:105: No supported regular expression matcher found: java.lang.ClassNotFoundException:
      5. Total time: 0 seconds

  1. Now we need to push the jar file created using above command to the device under test.
Path/yatendra$ adb push bin/yatendra.jar /data/local/tmp
  • We are pushing yatendra.jar to /data/local/tmp

  1. Now to run the particular test case we need to execute one more command :
Path/yatendra$ adb shell uiautomator runtest yatendra.jar -c com.yatendra.writeMsg
  • Here com.yatendra.writeMsg is the particular testcase we are running.
  • Com.staish is package name and write_Msg is class name

Friday, 23 May 2014

Simple steps to automate test cases using robotium

  1. Create one workspace or Directory named as “Workspace_Robotium”
  2. Start eclipse.
  3. Choose workspace.
  4. New -> Project -> New Project -> Android application project
  5. Application Name -> “Testing”
  6. Project Name -> “Testing_Robotium
  7. Next ...
  8. New -> Android Test Project
    Project Name -> “Testing”
    Next . Select “this project”
  9. Copy/ Take Apk to test
  10. Open as Archive
  11. Delete META-INF file
  12. Change the sign and install the Apk
    1. Copy ~/.android/debug.keystore to apk containing directory.
    2. jarsigner -verbose -keystore debug.keystore ApkUnderTest.apk androiddebugkey
    3. jarsigner -verify ApkUnderTEst.apk
    4. Create tmp Folder
    5. zipalign -v 4 ApkUnderTest.apk tmp\ApkUnderTest.apk
    6. adb install tmp/ApkUnderTest.apk
  13. Android Test project
    1. Build Path-> android Build path
  14. Android test Project
    1. Change Activity launch name
    2. Change Android Manifest .xml file \
    3. <instrumentation
    This you will get it from Logcat.
  15. Make a class and Paste the code.
package com.example.testing_robotium.test;

import android.test.ActivityInstrumentationTestCase2;
import android.view.View;


public class Test_sensor_simple extends ActivityInstrumentationTestCase2 {

// this is the name of the Activity u are launching i.e the main activity of the android App
// u have to place the name of the main activity of the application under test .

private static final String LAUNCHER_ACTIVITY_FULL_CLASSNAME = "";
// this will check if the Activity u have mentioned is present on the Device
private static Class launcherActivityClass;
static {
try {
launcherActivityClass = Class
} catch (ClassNotFoundException e) {
throw new RuntimeException(e);

public Test_sensor_simple() throws ClassNotFoundException {

//initialize the Robot class
private Solo solo;

// this is to start the activity which need to be Tested
protected void setUp() throws Exception {
solo = new Solo(getInstrumentation(), getActivity());

public void testDisplayBlackBox1() {
solo.clickOnText("Game", 3);

// this is the test Cases

//this is to close the Activity After the Test cases are completed
public void tearDown() throws Exception {


Monday, 10 February 2014

NUMA (Non-Uniform Memory Access): An Overview

NUMA becomes more common because memory controllers get close to execution units on microprocessors.

NUMA (non-uniform memory access) is the phenomenon that memory at various points in the address space of a processor have different performance characteristics. At current processor speeds, the signal path length from the processor to memory plays a significant role. Increased signal path length not only increases latency to memory but also quickly becomes a throughput bottleneck if the signal path is shared by multiple processors. The performance differences to memory were noticeable first on large-scale systems where data paths were spanning motherboards or chassis. These systems required modified operating-system kernels with NUMA support that explicitly understood the topological properties of the system's memory (such as the chassis in which a region of memory was located) in order to avoid excessively long signal path lengths. (Altix and UV, SGI's large address space systems, are examples. The designers of these products had to modify the Linux kernel to support NUMA; in these machines, processors in multiple chassis are linked via a proprietary interconnect called NUMALINK.)
Today, processors are so fast that they usually require memory to be directly attached to the socket that they are on. A memory access from one socket to memory from another has additional latency overhead to accessing local memory—it requires the traversal of the memory interconnect first. On the other hand, accesses from a single processor to local memory not only have lower latency compared to remote memory accesses but do not cause contention on the interconnect and the remote memory controllers. It is good to avoid remote memory accesses. Proper placement of data will increase the overall bandwidth and improve the latency to memory.
As the trend toward improving system performance by bringing memory nearer to processor cores continues, NUMA will play an increasingly important role in system performance. Modern processors have multiple memory ports, and the latency of access to memory varies depending even on the position of the core on the die relative to the controller. Future generations of processors will have increasing differences in performance as more cores on chip necessitate more sophisticated caching. As the access properties of these different kinds of memory continue to diverge, operating systems may need new functionality to provide good performance.
NUMA systems today (2013) are mostly encountered on multisocket systems. A typical high- end business-class server today comes with two sockets and will therefore have two NUMA nodes. Latency for a memory access (random access) is about 100 ns. Access to memory on a remote node adds another 50 percent to that number.
Performance-sensitive applications can require complex logic to handle memory with diverging performance characteristics. If a developer needs explicit control of the placement of memory for performance reasons, some operating systems provide APIs for this (for example, Linux, Solaris, and Microsoft Windows provide system calls for NUMA). However, various heuristics have been developed in the operating systems that manage memory access to allow applications to transparently utilize the NUMA characteristics of the underlying hardware.
A NUMA system classifies memory into NUMA nodes (which Solaris calls locality groups). All memory available in one node has the same access characteristics for a particular processor. Nodes have an affinity to processors and to devices. These are the devices that can use memory on a NUMA node with the best performance since they are locally attached. Memory is called node local if it was allocated from the NUMA node that is best for the processor. For example, the NUMA system exhibited in Figure 1 has one node belonging to each socket, with four cores each.
A System with Two NUMA Nodes and Eight Processors The process of assigning memory from the NUMA nodes available in the system is called NUMA placement. As placement influences only performance and not the correctness of the code, heuristic approaches can yield acceptable performance. In the special case of noncache-coherent NUMA systems, this may not be true since writes may not arrive in the proper sequence in memory. However, there are multiple challenges in coding for noncache-coherent NUMA systems. We restrict ourselves here to the common cache-coherent NUMA systems.
The focus in these discussions will be mostly on Linux since this operating system has refined NUMA facilities and is widely used in performance-critical environments today. The author was involved with the creation of the NUMA facilities in Linux and is most familiar with those.
Solaris has somewhat comparable features (see;; and, but the number of systems deployed is orders of magnitude less. Work is under way to add support to other Unix-like operating systems, but that support so far has been mostly confined to operating-system tuning parameters for placing memory accesses. Microsoft Windows also has a developed NUMA subsystem that allows placing memory structures effectively, but the software is used mostly for enterprise applications A System with Two NUMA Nodes and Eight Processors NUMA node 0 NUMA node 1 core core core core interconnect core core core core rather than high-performance computing. Memory-access speed requirements for enterprise-class applications are frequently more relaxed than in high-performance computing, so less effort is spent on NUMA memory handling in Windows than in Linux.

How Operating Systems Handle Numa Memory

There are several broad categories in which modern production operating systems allow for the management of NUMA: accepting the performance mismatch, hardware memory striping, heuristic memory placement, a static NUMA configurations, and application-controlled NUMA placement.

Ignore The Difference

Since NUMA placement is a best-effort approach, one option is simply to ignore the possible performance benefit and just treat all memory as if no performance differences exist. This means that the operating system is not aware of memory nodes. The system is functional, but performance varies depending on how memory happens to be allocated. The smaller the differences between local and remote accesses, the more viable this option becomes.
This approach allows software and the operating system to run unmodified. Frequently, this is the initial approach for system software when systems with NUMA characteristics are first used. The performance will not be optimal and will likely be different each time the machine and/ or application runs, because the allocation of memory to performance-critical segments varies depending on the system configuration and timing effects on boot-up.

Memory Striping In Hardware

Some machines can set up the mapping from memory addresses to the cache lines in the nodes in such a way that consecutive cache lines in an address space are taken from different memory controllers (interleaving at the cache-line level). As a result, the NUMA effects are averaged out (since structures larger than a cache line will then use cache lines on multiple NUMA nodes). Overall system performance is more deterministic compared with the approach of just ignoring the difference, and the operating system still does not need to know about the difference in memory performance, meaning no NUMA support is needed in the operating system. The danger of overloading a node is reduced since the accesses are spread out among all available NUMA nodes.
The drawback is that the interconnect is in constant use. Performance will never be optimal since the striping means that cache lines are frequently accessed from remote NUMA nodes.

Heuristic Memory Placement For Applications

If the operating system is NUMA-aware (under Linux, NUMA must be enabled at compile time and the BIOS or firmware must provide NUMA memory information for the NUMA capabilities to become active; NUMA can be disabled and controlled at runtime with a kernel parameter), it is useful to have measures that allow applications to allocate memory in ways that minimize signal path length so performance is increased. The operating system has to adopt a policy that maximizes performance for as many applications as possible. Most applications run with improved performance using the heuristic approach, especially compared with the approaches discussed earlier. A NUMA-aware operating system determines memory characteristics from the firmware and can therefore tune its own internal operations to the memory configuration. Such tuning requires coding effort, however, so only performance-critical portions of the operating system tend to get optimized for NUMA affinities, whereas less-performance-critical components tend to operate on the assumption that all memory is equal.
The most common assumptions made by the operating system are that the application will run on the local node and that memory from the local node is to be preferred. If possible, all memory requested by a process will be allocated from the local node, thereby avoiding the use of the cross- connect. The approach does not work, though, if the number of required processors is higher than the number of hardware contexts available on a socket (when processors on both NUMA nodes must be used); if the application uses more memory than available on a node; or if the application programmer or the scheduler decides to move application threads to processors on a different socket after memory allocation has occurred.
In general, small Unix tools and small applications work very well with this approach. Large applications that make use of a significant percentage of total system memory and of a majority of the processors on the system will often benefit from explicit tuning or software modifications that take advantage of NUMA.
Most Unix-style operating systems support this mode of operation. Notably, FreeBSD and Solaris have optimizations to place memory structures to avoid bottlenecks. FreeBSD can place memory round-robin on multiple nodes so that the latencies average out. This allows FreeBSD to work better on systems that cannot do cache-line interleaving on the BIOS or hardware level (additional NUMA support is planned for FreeBSD 10). Solaris also replicates important kernel data structures per locality group.

Special Numa Configuration For Applications

The operating system provides configuration options that allow the operator to tell the operating system that an application should not be run with the default assumptions regarding memory placement. It is possible to establish memory-allocation policies for an application without modifying code.
Command-line tools exist under Linux that can set up policies to determine memory affinities (taskset, numactl). Solaris has tunable parameters for how the operating system allocates memory from locality groups. These are roughly comparable to Linux's process memory-allocation policies.

Application Control Of Numa Allocations

The application may want fine-grained control of how the operating system handles allocation for each of its memory segments. For that purpose, system calls exist that allow the application to specify which memory region should use which policies for memory allocations.
The main performance issues typically involve large structures that are accessed frequently by the threads of the application from all memory nodes and that often contain information that needs to be shared among all threads. These are best placed using interleaving so that the objects are distributed over all available nodes.

How Does Linux Handle Numa?

Linux manages memory in zones. In a non-NUMA Linux system, zones are used to describe memory ranges required to support devices that are not able to perform DMA (direct memory access) to all memory locations. Zones are also used to mark memory for other special needs such as movable memory or memory that requires explicit mappings for access by the kernel (HIGHMEM), but that is not relevant to the discussion here. When NUMA is enabled, more memory zones are created and they are also associated with NUMA nodes. A NUMA node can have multiple zones since it may be able to serve multiple DMA areas. How Linux has arranged memory can be determined by looking at /proc/zoneinfo. The NUMA node association of the zones allows the kernel to make decisions involving the memory latency relative to cores.
On boot-up, Linux will detect the organization of memory via the ACPI (Advanced Configuration and Power Interface) tables provided by the firmware and then create zones that map to the NUMA nodes and DMA areas as needed. Memory allocation then occurs from the zones. Should memory in one zone become exhausted, then memory reclaim occurs: the system will scan through the least recently used pages trying to free a certain number of pages. Counters that show the current status of memory in various nodes/zones can also be seen in /proc/zoneinfo. Figure 2 shows types of memory in a zone/node. Types of Memory in a Zone/Node free memory unmapped page cache (e.g. cached disk contents) page mapped to processes (e.g. text segments, mmapped les) anonymous pages (e.g. stack, heap) dirty or writeback pages (disk I/O e.g.) unevictable pages (mlock e.g.) kernel, driver and unreclaimable slab memory
Types of Memory in a Zone/Node

Memory Policies

How memory is allocated under NUMA is determined by a memory policy. Policies can be specified for memory ranges in a process's address space, or for a process or the system as a whole. Policies for a process override the system policy, and policies for a specific memory range override a process's policy.
The most important memory policies are:
NODE LOCAL. The allocation occurs from the memory node local to where the code is currently executing.
INTERLEAVE. Allocation occurs round-robin. First a page will be allocated from node 0, then from node 1, then again from node 0, etc. Interleaving is used to distribute memory accesses for structures that may be accessed from multiple processors in the system in order to have an even load on the interconnect and the memory of each node.
There are other memory policies that are used in special situations, which are not mentioned here for brevity's sake. The two policies just mentioned are generally the most useful and the operating system uses them by default. NODE LOCAL is the default allocation policy if the system is up and running.
The Linux kernel will use the INTERLEAVE policy by default on boot-up. Kernel structures created during bootstrap are distributed over all the available nodes in order to avoid putting excessive load on a single memory node when processes require access to the operating-system structures. The system default policy is changed to NODE LOCAL when the first userspace process (init daemon) is started.
The active memory allocation policies for all memory segments of a process (and information that shows how much memory was actually allocated from which node) can be seen by determining the process id and then looking at the contents of /proc/<pid>/numa_maps.

Basic Operations On Process Startup

Processes inherit their memory policy from their parent. Most of the time the policy is left at the default, which means NODE LOCAL. When a process is started on a processor, memory is allocated for that process from the local NUMA node. All other allocations of the process (through growing the heap, page faults, mmap, and so on) will also be satisfied from the local NUMA node.
The Linux scheduler will attempt to keep the process cache hot during load balancing. This means the scheduler's preference is to leave the process on processors that share the L1-processor cache, then on processors that share L2, and then on processors that share L3, with the processor that the process ran on last. If there is an imbalance beyond that, the scheduler will move the process to any other processor on the same NUMA node.
As a last resort the scheduler will move the process to another NUMA node. At that point the code will be executing on the processor of one node, while the memory allocated before the move has been allocated on the old node. Most memory accesses from the process will then be remote, which will cause the performance of the process to degrade.
There has been some recent work in making the scheduler NUMA-aware to ensure that the pages of a process can be moved back to the local node, but that work is available only in Linux 3.8 and later, and is not considered mature. Further information on the state of affairs may be found on the Linux kernel mailing lists and in articles on


Linux typically allocates all available memory in order to cache data that may be used again later. When memory begins to be low, reclaim will be used to find pages that are either not in use or unlikely to be used soon. The effort required to evict a page from memory and to get the page back if needed varies by type of page. Linux prefers to evict pages from disk that are not mapped into any process space because it is easy to drop all references to the page. The page can be reread from disk if it is needed later. Pages that are mapped into a process's address space require that the page first be removed from that address space before the page can be reused. A page that is not a copy of a page from disk (anonymous pages) can be evicted only if the page is first written out to swap space (an expensive operation). There are also pages that cannot be evicted at all, such as mlocked() memory or pages in use for kernel data.
The impact of reclaim on the system can therefore vary. In a NUMA system multiple types of memory will be allocated on each node. The amount of free space on each node will vary. So if there is a request for memory and using memory on the local node would require reclaim but another node has enough memory to satisfy the request without reclaim, the kernel has two choices:
• Run a reclaim pass on the local node (causing kernel processing overhead) and then allocate node- local memory to the process.
• Just allocate from another node that does not need a reclaim pass. Memory will not be node local, but we avoid frequent reclaim passes. Reclaim will be performed when all zones are low on free memory. This approach reduces the frequency of reclaim and allows more of the reclaim work to be done in a single pass.
For small NUMA systems (such as the typical two-node servers) the kernel defaults to the second approach. For larger NUMA systems (four or more nodes) the kernel will perform a reclaim in order to get node-local memory whenever possible because the latencies have higher impacts on process performance.
There is a knob in the kernel that determines how the situation is to be treated in /proc/sys/vm/zone_reclaim. A value of 0 means that no local reclaim should take place. A value of 1 tells the kernel that a reclaim pass should be run in order to avoid allocations from the other node. On boot- up a mode is chosen based on the largest NUMA distance in the system.
If zone reclaim is switched on, the kernel still attempts to keep the reclaim pass as lightweight as possible. By default, reclaim will be restricted to unmapped page-cache pages. The frequency of reclaim passes can be further reduced by setting /proc/sys/vm/min_unmapped_ratio to the percentage of memory that must contain unmapped pages for the system to run a reclaim pass. The default is 1 percent.
Zone reclaim can be made more aggressive by enabling write-back of dirty pages or the swapping of anonymous pages, but in practice doing so has often resulted in significant performance issues.

Basic Numa Command-Line Tools

The main tool used to set up the NUMA execution environment for a process is numactl. Numactl can be used to display the system NUMA configuration, and to control shared memory segments. It is possible to restrict processes to a set of processors, as well as to a set of memory nodes. Numactl can be used, for example, to avoid task migration between nodes or restrict the memory allocation to a certain node. Note that additional reclaim passes may be required if the allocation is restricted. Those cases are not influenced by zone-reclaim mode because the allocation is restricted by a memory policy to a specific set of nodes, so the kernel cannot simply pick memory from another NUMA node.
Another tool that is frequently used for NUMA is taskset. It basically allows only binding of a task to processors and therefore has only a subset of numactl's capability. Taskset is heavily used in non-NUMA environments, and its familiarity results in developers preferring to use taskset instead of numactl on NUMA systems.

Numa Information

There are numerous ways to view information about the NUMA characteristics of the system and of various processes currently running. The hardware NUMA configuration of a system can be viewed by using numactl -hardware. This includes a dump of the SLIT (system locality information table) that shows the cost of accesses to different nodes in a NUMA system. The example below shows a NUMA system with two nodes. The distance for a local access is 10. A remote access costs twice as much on this system (20). This is the convention, but the practice of some vendors (especially for two-node systems) is simply to report 10 and 20 without regard to the actual latency differences to memory.

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 131026 MB
node 0 free: 588 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 131072 MB
node 1 free: 169 MB
node distances:
node  0  1
  0: 10 20
  1: 20 10

Numastat is another tool that is used to show how many allocations were satisfied from the local node. Of particular interest is the numa_miss counter, which indicates that the system assigned memory from a different node in order to avoid reclaim. These allocations also contribute to other node. The remainder of the count is intentional off-node allocations. The amount of off-node memory can be used as a guide to figure out how effectively memory was assigned to processes running on the system.

$ numastat
                node0       node1
numa_hit        13273229839 4595119371
numa_miss       2104327350  6833844068
numa_foreign    6833844068  2104327350
interleave_hit  52991       52864
local_node      13273229554 4595091108
other_node      2104327635  6833872331

How memory is allocated to a process can be seen via a status file in /proc/<pid>/numa_maps:

# cat /proc/1/numa_maps
7f830c175000 default anon=1 dirty=1 active=0 N1=1
7f830c177000 default file=/lib/x86_64-linux-gnu/ anon=1 dirty=1 active=0 N1=1
7f830c178000 default file=/lib/x86_64-linux-gnu/ anon=2 dirty=2 active=0 N1=2
7f830c17a000 default file=/sbin/init mapped=18 N1=18
7f830c39f000 default file=/sbin/init anon=2 dirty=2 active=0 N1=2
7f830c3a1000 default file=/sbin/init anon=1 dirty=1 active=0 N1=1
7f830dc56000 default heap anon=223 dirty=223 active=0 N0=52 N1=171
7fffb6395000 default stack anon=5 dirty=5 active=1 N1=5

The output shows the virtual address of the policy and then some information about the NUMA characteristics of the memory range. Anon means that the pages do not have an associated file on disk. Nx shows the number of pages on the respective node.
The information about how memory is used in the system as a whole is available in /proc/meminfo. The same information is also available for each NUMA node in /sys/devices/system/node/node<X>/meminfo. Numerous other bits of information are available from the directory where meminfo is located. It is possible to compact memory, get distance tables, and manage huge pages and mlocked pages by inspecting and writing values to key files in that directory.

First-Touch Policy

Specifying memory policies for a process or address range does not cause any allocation of memory, which is often confusing to newcomers. Memory policies specify what should happen when the system needs to allocate memory for a virtual address. Pages in a process's memory space that have not been touched or that are zero do not have memory assigned to them. The processor will generate a hardware fault when a process touches or writes to an address (page fault) that is not yet populated. During page-fault handling by the kernel, the page is allocated. The instruction that caused the fault is then restarted and will be able to access the memory as needed.
What matters, therefore, is the memory policy in effect when the allocation occurs. This is called the first touch. The first-touch policy refers to the fact that a page is allocated based on the effective policy when some process first uses a page in some fashion.
The effective memory policy on a page depends on memory policies assigned to a memory range or on a memory policy associated with a task. If a page is only in use by a single thread, then there is no ambiguity as to which policy will be followed. However, pages are often used by multiple threads. Any one of them may cause the page to be allocated. If the threads have different memory policies, then the page may as a result seem to be allocated in surprising ways for a process that also sees the same page later.
For example, it is fairly common that text segments are shared by all processes that use the same executable. The kernel will use the page from the text segment if it is already in memory regardless of the memory policy set on a range. The first user of a page in a text segment will therefore determine its location. Libraries are frequently shared among binaries, and especially the C library will be used by almost all processes on the system. Many of the most-used pages are therefore allocated during boot-up when the first binaries run that use the C library. The pages will at that point become established on a particular NUMA node and will stay there for the time that the system is running.
First-touch phenomena limit the placement control that a process has over its data. If the distance to a text segment has a significant impact on process performance, then dislocated pages will have to be moved in memory. Memory could appear to have been allocated on NUMA nodes not permitted by the memory policy of the current task because an earlier task has already brought the data into memory.

Moving Memory

Linux has the capability to move memory. The virtual address of the memory in the process space stays the same. Only the physical location of the data is moved to a different node. The effect can be observed by looking at /proc/<pid>/numa_maps before and after a move.
Migrating all of a process's memory to a node can optimize application performance by avoiding cross-connect accesses if the system has placed pages on other NUMA nodes. However, a regular user can move only pages of a process that are referenced only by that process (otherwise, the user could interfere with performance optimization of processes owned by other users). Only root has the capability to move all pages of a process.
It can be difficult to ensure that all pages are local to a process since some text segments are heavily shared and there can be only one page backing an address of a text segment. This is particularly an issue with the C library and other heavily shared libraries.
Linux has a migratepages command-line tool to manually move pages around by specifying a pid and the source and destination nodes. The memory of the process will be scanned for pages currently allocated on the source node. They will be moved to the destination node.

Numa Scheduling

The Linux scheduler had no notion of the page placement of memory in a process until Linux 3.8. Decisions about migrating processes were made based on an estimate of the cache hotness of a process's memory. If the Linux scheduler moved the execution of a process to a different NUMA node, the performance of that process could be harmed because its memory now needed access via the cross-connect. Once that move was complete the scheduler would estimate that the process memory was cache hot on the remote node and leave the process there as long as possible. As a result, administrators who wanted the best performance felt it best not to let the Linux scheduler interfere with memory placement. Processes were often pinned to a specific set of processors using taskset, or the system was partitioned using the cpusets feature to keep applications within the NUMA node boundaries.
In Linux 3.8 the first steps were taken to address this situation by merging a framework that will eventually enable the scheduler to consider the page placement and perhaps automatically migrate pages from remote nodes to the local node. However, a significant development effort is still needed, and the existing approaches do not always enhance load performance. This was the state of affairs in April 2013, when this section was written. More recent information may be found on the Linux kernel mailing list on or in articles on Linux Weekly News ( See, for example,


NUMA support has been around for a while in various operating systems. NUMA support in Linux has been available since early 2000 and is continually being refined. Kernel NUMA support frequently optimizes process execution without the need for user intervention, and in most use cases an operating system can simply be run on a NUMA system, providing decent performance for typical applications.
Special NUMA configuration through tools and kernel configuration comes into play when the heuristics provided by the operating system do not provide satisfactory application performance to the end user. This is typically the case in high-performance computing, high-frequency trading, and for realtime applications, but these issues recently have become more significant for regular enterprise-class applications. Traditionally, NUMA support required special knowledge about the application and hardware for proper tuning using the knobs provided by the operating systems. Recent developments (especially around the Linux NUMA scheduler) will likely enable operating systems to automatically balance a NUMA application load properly over time.
The use of NUMA needs to be guided by the increase in performance that is possible. The larger the difference between local and remote memory access, the greater the benefits that arise from NUMA placement. NUMA latency differences are due to memory accesses. If the application does not rely on frequent memory accesses (because, for example, the processor caches absorb most of the memory operations), NUMA optimizations will have no effect. Also, for I/O-bound applications the bottleneck is typically the device and not memory access. An understanding of the characteristics of the hardware and software is required in order to optimize applications using NUMA.