Dissecting Python List allocator (CPython)

List is one of the most frequently used data structure in Python applications. The code for Python list object allocator is present in Python-2.7.5/Objects/listobject.c.

Python allocates list objects in heap and uses underlying malloc based APIs. List is a container of Python object references. In plain C terminology. it is a dynamic array of pointers.

Let’s understand basic creation and insertion APIs.

  • list_resize(): A list is of type PyListObject and is equivalent to PyObject. Every resize request allocates more than asked size.

Allocation unit size is : (size_requested + (size_requested << 3 + (size_requested < 9 : 3 or 6) ).

It gives a growth pattern of (0, 4, 8, 16, 25, 35, 46, 58, 72, 88).

  • free_list: Python keeps PyList_MAXFREELIST (defined to value 80) of free lists to save on lists allocation and deallocation.
  • PyList_New(): A new list allocation requests is served from either a free list or a new GC tracked object creation of type PyListObject. Next, a malloc request is made for requested list size.

Size calculated: nbytes = size * sizeof(PyObject *);

Object allocation: PyListObject *op = free_list[numfree] OR PyObject_GC_New(PyListObject, &PyList_Type);

Memory allocation for references: op->ob_item = (PyObject **) PyMem_MALLOC(nbytes);

Advertisements

How to limit heap size of a Python process on Linux?

Python allows customization to resource allocation for a Python VM instance.

http://docs.python.org/2/library/resource.html

import resource
resource.setrlimit(resource.RLIMIT_AS, (megs * 1048576L, -1L))

Openstack swift: EADDRNOTAVAIL error

Problem

swift-bench keeps EADDRNOTAVAIL error with a highly concurrency setting and multiple swift-bench clients.

Setup

Ubuntu 13, Swift single machine installation (refer SAIO), swift-client runs local with no-proxy mode.

Solution

  • EADDRNOTAVAIL stands for either unavailability of ephemeral ports and a known kernel bug.
  • Check your range of ports: $cat /proc/sys/net/ipv4/ip_local_port_range
  • swift-bench in no-proxy mode uses direct client class based on Python’s HTTPLib. I saw that code for object write and read did not have connection close call. So, I added that. Please refer swift/common/direct_client.py.
  • The kernel bug is based on usage of bind() call during a connection setup from client. swift-bench so not use bind. So this possibility is ruled out.
  • Swift administration guide advises of following setting:
The following settings should be in /etc/sysctl.conf:
# disable TIME_WAIT.. wait..
net.ipv4.tcp_tw_recycle=1
net.ipv4.tcp_tw_reuse=1# disable syn cookies
net.ipv4.tcp_syncookies = 0
To load the updated sysctl settings, run $sudo sysctl -p

The above mentioned solutions reduced the problem significantly. If there is a better solution, let me know.

Best Python memcache client: Python-memcached v/s Pylibmc

Recently, I tried to access memcached server on local and remote setup in a Python application. I am sharing my experience with two of most popular clients.

Python-memcached

  • The first choice was a pure¬† Python solution to access memcached, that is python-memcached. It is a simple to install, understand and use solution.
  • It do not offer many customization setting, at least on its README ūüôā
  • Very less or non-existent, useful documentation
  • Fails consistently for a high frequency request to memcached server. Failures are for simultaneous read on a file xxx.¬† I could not find an easy fix for this problem.
  • I do not suggest using it for a highly scalable application.

Pylibmc

  • Bare minimum Python wrapper to C/C++ API to memcached
  • Installation was simple similar to Python-memcached
  • Offers many useful options during connection setup (such as non-blocking requests, TCP no-delay disable)
  • Shows no problem in demanding and highly scalable application
  • Shows better performance than Python-memcached

My verdict: Pylibmc is a clear winner ūüôā

Python Internals: Understanding Python data model (I)

Python sees everything as object. Every object has an identity, value and a type. Object identity and type are invariable.

Python Data Model
Python Data Model

Object type determines if value is mutable or otherwise.

Lifetime of object is based on reference count mechanism.

Object Container

Object containers are: list, dictionary, tuple, set. Containers keep reference (object identity) to objects. Mutability of container is based on references. The value of the referred object could be mutable.

Let’s see how they behave:

from sys import getrefcount

a = 1
b = 1

list1 = [] 
list2 = [] 

t1 = (a, b)
t2 = (a, b)

# a and b share reference to same object ID 
print "a=", id(a)
print "b=", id(b)

# Constant 1 has ref count of +2 (a and b)
print "getrefcount(1)=", getrefcount(1)

# Constant 1 has ref count of +3 now (a,b and c)
c = 1
print "getrefcount(1)=", getrefcount(1)

# Decrement the object ref count
del c
print "getrefcount(1)=", getrefcount(1)

# Default ref count of an unused new integer object is 3. But, why?
print "getrefcount(999999)=", getrefcount(999999)

print ""

# Mutable objects like list do not refer to same object ID, even
# though value of objects are same!
print "list1=", id(list1)
print "list2=", id(list2)
print "getrefcount(list1)=", getrefcount(list1)
print "getrefcount(list2)=", getrefcount(list2)

print ""

print "t1=", id(t1)
print "t2=" ,id(t2)

# Containers have default ref count of 2
print "getrefcount(t1)=", getrefcount(t1)
print "getrefcount(t2)=", getrefcount(t2)

# Changing contained object values do not modiy immutable container
a = 3
b = 10

print "t1=", id(t1)
print "t2=" ,id(t2)

# String literals are constant and are referred to like numbers
s1 = "hello"
s2 = "hello"

# s1 and s2 refer to same object
print "s1=", id(s1)
print "s2=" ,id(s2)
print "getrefcount(hello)=", getrefcount("hello")

# The first use of a literal uses 3 ref count
print "getrefcount(hello!!)=", getrefcount("hello!!")

s3 = "hello!"
print "getrefcount(hello!)=", getrefcount("hello!")

Parenthesize an expression in Python

    def pref(op):
        print "called with op", op
        ret = -1
        if op == '+':
            print "matched +"
            ret = 1
        if op == '-':
            print "matched -"
            ret = 2
        if op == '*':
            print "matched *"
            ret = 3
        if op == '/':
            print "matched /"
            ret = 4
    
        return ret
    
    def evaluate(expr, operand_stack, operator_stack):
        print "**In evaluate**"
        print operator_stack
        print operand_stack
    
        expr1 = operand_stack.pop()
        expr2 = operand_stack.pop()
        op    = operator_stack.pop()
    
        # Parenthesize the expression
        expr = "(" + expr2 + op + expr1 + ")"
        print "expr1", expr1
        print "expr2", expr2
        print "expr", expr
    
        # Push the result back on the stack
        operand_stack.append(expr)
    
        print operator_stack
        print operand_stack
        print "**Out evaluate**"
        return expr
    
    def looper(str, expr, operator_stack, operand_stack):
        l = 0
        cnt = len(str)
    
        # Loop over the input string
        while  l < cnt:
            if str[l] in ('+', '-', '*', '/'):
                print "operator found: op, index", str[l], l
                print operator_stack, len(operator_stack)
    
                x = len(operator_stack) - 1
                if x > 0:
                    print "Comparing:", operator_stack[x], str[l]
    
                    # If op on stack has higher preference than the op in question
                    if (pref(operator_stack[x]) > pref(str[l])):
                        expr = evaluate(expr, operand_stack, operator_stack)
                operator_stack.append(str[l])
            else:
                # Add the operand to operand stack
                operand_stack.append(str[l])
            l += 1
    
        print operator_stack
        print operand_stack
    
        print "Take care of last elements"
        op_cnt = len(operator_stack)
        while op_cnt:
            expr = evaluate(expr, operand_stack, operator_stack)
            op_cnt -= 1
    
        print operator_stack
        print operand_stack
    
    if __name__ == '__main__':
        str = "a+c*d-e/w*x+a-s"
        cnt = len(str)
    
        operand_stack  = []
        operator_stack  = []
        expr = ""
        looper(str, expr, operator_stack, operand_stack)
    
        print "Output=>", operand_stack[0]

Pylucene- Part III: Hightlighting in Search

The search result can be customized to highlight the phrases that contain the requested keyword. The following code uses “Highlighter” class from Pylucene. We emit result in HTML formatted syntax.

from lucene import \
            QueryParser, IndexSearcher, IndexReader, StandardAnalyzer, \
        TermPositionVector, SimpleFSDirectory, File, SimpleSpanFragmenter, Highlighter, \
    QueryScorer, StringReader, SimpleHTMLFormatter, \
            VERSION, initVM, Version
import sys

FIELD_CONTENTS = "contents"
FIELD_PATH = "path"
#QUERY_STRING = "lucene and restored"
QUERY_STRING = sys.argv[1]
STORE_DIR = "/home/kanaujia/lucene_index"

if __name__ == '__main__':
    initVM()
    print 'lucene', VERSION

    # Get handle to index directory
    directory = SimpleFSDirectory(File(STORE_DIR))

    # Creates a searcher searching the provided index.
    ireader  = IndexReader.open(directory, True)

    # Implements search over a single IndexReader.
    # Use a single instance and use it across queries
    # to improve performance.
    searcher = IndexSearcher(ireader)

    # Get the analyzer
    analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)

    # Constructs a query parser.
    queryParser = QueryParser(Version.LUCENE_CURRENT, FIELD_CONTENTS, analyzer)

    # Create a query
    query = queryParser.parse(QUERY_STRING)

    topDocs = searcher.search(query, 50)

    # Get top hits
    scoreDocs = topDocs.scoreDocs
    print "%s total matching documents." % len(scoreDocs)

    HighlightFormatter = SimpleHTMLFormatter();
    query_score = QueryScorer (query)

    highlighter = Highlighter(HighlightFormatter, query_score)

    # Set the fragment size. We break text in to fragment of 64 characters
    fragmenter  = SimpleSpanFragmenter(query_score, 64);
    highlighter.setTextFragmenter(fragmenter); 

    for scoreDoc in scoreDocs:
        doc = searcher.doc(scoreDoc.doc)
    text = doc.get(FIELD_CONTENTS)
    ts = analyzer.tokenStream(FIELD_CONTENTS, StringReader(text))
        print doc.get(FIELD_PATH)
        print highlighter.getBestFragments(ts, text, 3, "...")
    print ""

The code is an extension of search code discussed in Part-II.

We create a HTML formatter with SimpleHTMLFormatter. We create a QueryScorer to iterate over resulting documents in non-decreasing doc ID.

HighlightFormatter = SimpleHTMLFormatter();
query_score = QueryScorer (query)

highlighter = Highlighter(HighlightFormatter, query_score)

We break the text content into 64 bytes character set.
fragmenter  = SimpleSpanFragmenter(query_score, 64);
highlighter.setTextFragmenter(fragmenter);

for scoreDoc in scoreDocs:
doc = searcher.doc(scoreDoc.doc)
text = doc.get(FIELD_CONTENTS)
ts = analyzer.tokenStream(FIELD_CONTENTS, StringReader(text))
print doc.get(FIELD_PATH)

Now we set number of lines for phrases in a document.

print highlighter.getBestFragments(ts, text, 3, “…”)

Results

kanaujia@ubuntu:~/work/Py/pylucy2/pylucy$ python searcher_highlight.py hello
lucene 3.6.1
50 total matching documents.
/home/kanaujia/Dropbox/PyConIndia/fsMgr/root/hello
hi hello

/home/kanaujia/Dropbox/PyConIndia/fsMgr.v4/root/hello
hi hello

/home/kanaujia/Dropbox/.dropbox.cache/2012-09-27/hello (deleted 505bda23-9-e8756a51)
hi hello

/home/kanaujia/Dropbox/PyConIndia/fsMgr.v1/root/hello.html

Hello htmls

{% module Hello() %}