Notes on RGW Sytem Object State

RGW raw object store has following structure:

// rgw/rgw_rados.h
struct RGWRawObjState {
  rgw_raw_obj obj;
  bool has_attrs{false};
  bool exists{false};
  uint64_t size{0};
  ceph::real_time mtime;
  uint64_t epoch;
  bufferlist obj_tag;
  bool has_data{false};
  bufferlist data;
  bool prefetch_data{false};
  uint64_t pg_ver{0};

  /* important! don't forget to update copy constructor */

  RGWObjVersionTracker objv_tracker;

  map<string, bufferlist> attrset;
  RGWRawObjState() {}

Written with StackEdit.

Notes on RGW Request Path

The principle class is RGWOp. It defines request state, RGWRados store pointer.
A RGW request struct req_state has

Ceph contect
op type info
account, bucket info
zonegroup name
RGWBucketInfo bucket_info
RGWUserInfo *user

Op Execution

RGWGetObj::execute() is the primary execution context under the class RGWGetObj. It uses interfaces of class RGWRados::Object to perfrom I/O ops. The read op carries various information such as zone id, pg version, mod_ptr, object size etc.
Next, the RGWRados::Object::prepare( ) is called.

Written with StackEdit.

Notes on RGW Manifest

RGW maintains a manifest of each object. The class RGWObjManifest implements the details with object head, tail placement.
Manifest is written as XATTRs along with RGWRados::Object::Write::_do_write_meta( ).

/**
 * Write/overwrite an object to the bucket storage.
 * bucket: the bucket to store the object in
 * obj: the object name/key
 * data: the object contents/value
 * size: the amount of data to write (data must be this long)
 * accounted_size: original size of data before compression, encryption
 * mtime: if non-NULL, writes the given mtime to the bucket storage
 * attrs: all the given attrs are written to bucket storage for the given object
 * exclusive: create object exclusively
 * Returns: 0 on success, -ERR# otherwise.
 */

Written with StackEdit.

Notes on Ceph librados Client

Cluster Connection

A client is an application that uses librados to connect to a Ceph cluster.
It needs a cluster object populatd with cluster info (cluster name, info from ceph.conf)
Then the client do a rados_connect and cluster handle is populated.
A cluster handle can bind with different pools.

Cluster IO context

The I/O happens on a pool so the connection needs to bind to a pool.
The connection to a pool gives the client an I/O context.
The client only species an object name/xattr and librados maps it to a PG & OSD in the cluster.
An obhect write to rados require key, value, and value size.
librados::bufferlist is primarily used for storing object value.

References

http://docs.ceph.com/docs/luminous/rados/api/librados-intro/

Written with StackEdit.

General Script to run Linux Shell Commands

#for i in {0..24}
#for i in $(cat meta.osd.ip)
do
    #sudo ceph osd purge $i --yes-i-really-mean-it
    #ssh -q -o "StrictHostKeyChecking no" $i sudo reboot
done

Written with StackEdit.

Ceph Outage with OSDs Heartbeat failure on Hammer (0.94.6)

Symptoms

The cluster went down after 24 OSDs were added and marked in simultaneously.
This was an erasure coded (10+5) RGW cluster on Hammer.
All the OSDs started failing and eventually 50% of the OSDs were down.
Manual efforts to bring them up failed and we saw heartbeat failures in OSDs log.
All OSD were consuming ~15G RAM and OSDs were hitting Out of memory errors.

2018-07-18 08:58:12.794311 7f4aa0925700 -1 
osd.127 206901 heartbeat_check: 
no reply from osd.55 since 
back 2018-07-18 08:45:13.647493 
front 2018-07-18 08:45:13.647493 
(cutoff 2018-07-18 08:57:12.794247)

2018-07-18 08:58:12.794315 7f4aa0925700 -1 osd.127 206901
 heartbeat_check: no reply from osd.57 since back
  2018-07-18 08:45:42.452510 front 2018-07-18
   08:45:42.452510 (cutoff 2018-07-18 08:57:12.794247)

2018-07-18 08:58:12.794321 7f4aa0925700 -1 osd.127 206901
 heartbeat_check: no reply from osd.82 since back 
 2018-07-18 08:45:13.647493 front 2018-07-18 
 08:45:13.647493
  (cutoff 2018-07-18 08:57:12.794247)

OSDs maps were out of sync

2018-07-18 08:56:52.668789 7f4886d7b700  
0 -- 10.33.49.153:6816/505502 >> 10.33.213.157:6801/2707
 pipe(0x7f4a4f39d000 sd=26 :13251
  s=1 pgs=233 cs=2 l=0 c=0x7f4a4f1b8980).connect
   claims to be 10.33.213.157:6801/1003787 not 
   10.33.213.157:6801/2707 - wrong node!

An OSD has ~3000 threads, most of them in sleeping state.
Using GDB and getting a backtrace of all threads we found that most of the active threads were just Simple Messanger Pipe readers.
We were suspecting a memory leak in Ceph code.

Band-aid Fixes

Set norebalance, norecover, nobackfill
Adding swap memory to OSDs
Tuning heartbeat interval
Tuning OSD map sync and setting noout, nodown to let OSDs sync their maps.

$ sudo ceph daemon osd.148 status
{
    "cluster_fsid": "621d76ce-a208-42d6-a15b-154fcb09xcrt",
    "osd_fsid": "09650e4c-723e-45e0-b2ef-5b6d11a6da03",
    "whoami": 148,
    "state": "booting",
    "oldest_map": 156518,
    "newest_map": 221059,
    "num_pgs": 1295
}

Tuning OSD map cache size to 20
Finding processes other than Ceph
- Processes consuming network, CPU, and RAM
- Killing them
Starting OSDs one by one – that worked for us 🙂

RCA

The major culprit was a rogue process that was consuming massive network bandwidth on OSD nodes.
As network bandwidth was not enough, many messenger threads were just waiting.
The Simple Messanger threads are sync threads and would wait till they get through.
That is one of the reasons of an OSD having ~3000 threads and consuming ~15G of memory.
As network was saturated, OSDs heartbeat signals too were blocked and they were either committing suicide or dying of OOM.

References

OSD Map Sync tips: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg10187.html

Written with StackEdit.

Ceph Luminous build

struct RGWObjectCtx {
  RGWRados *store;
  void *user_ctx;

  RGWObjectCtxImpl<rgw_obj, RGWObjState> obj;
  RGWObjectCtxImpl<rgw_raw_obj, RGWRawObjState> raw;

  explicit RGWObjectCtx(RGWRados *_store) : store(_store), user_ctx(NULL), obj(store), raw(store) { }
  RGWObjectCtx(RGWRados *_store, void *_user_ctx) : store(_store), user_ctx(_user_ctx), obj(store), raw(store) { }
};

Written with StackEdit.

Ceph Bucket Lifecycle Code Details

The primary function to process a bucket life cycle is RGWLC::bucket_lc_process()
A bucket lifecycle policy is applied through RGWPutLC::execute()

Written with StackEdit.

Building Ceph on Debian Jessie

Checkout Ceph with --recursive
Comment code that installs setuptools and sudo ./install-deps.sh
./ceph/do_cmake.sh.
Switch to root using sudo su and run the following commands:

    echo "LC_ALL=en_US.UTF-8" >> /etc/environment   
        echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen 
        echo "LANG=en_US.UTF-8" > /etc/locale.conf
        locale-gen en_US.UTF-8

Now switch back to your login and create ~/.bash_profile:

    $ cat ~/.bash_profile  
    export LC_ALL=en_US.UTF-8  
    export LANG=en_US.UTF-8

Run source ~/.bash_profile
cd to ./ceph/build and run make -j 4

s3cmd SSL connection error

Problem Statement

 File "/usr/lib/python2.7/httplib.py", line 1263, in connect
    server_hostname=server_hostname)
  File "/usr/lib/python2.7/ssl.py", line 363, in wrap_socket
    _context=self)
  File "/usr/lib/python2.7/ssl.py", line 611, in __init__
    self.do_handshake()
  File "/usr/lib/python2.7/ssl.py", line 840, in do_handshake
    self._sslobj.do_handshake()
error: [Errno 0] Error

Environement

Debian 9
s3cmd 2.0.1

Solution

The problem happens due to SSL issue. To make s3cmd work, we should invoke it without SSL (–nossl) as following.

s3cmd --no-ssl get --access_key=5DTA7J1ORIQ3E7LMV9YD 
--secret_key=GIqPAez7zdHSC9r3HsMNOgJlHqHttvGi
 --host=10.xx.xx.xxx:80 --host-bucket=10.xx.xx.xx:80
 s3://TESTBUCKET/abc.tgz

References

https://sourceforge.net/p/s3tools/mailman/message/34389949/

Share this:

Op Execution

Share this:

Share this:

Cluster Connection

Cluster IO context

References

Share this:

Share this:

Symptoms

Band-aid Fixes

RCA

References

Share this:

Share this:

Share this:

Share this:

Problem Statement

Environement

Solution

References

Share this: