]> git.kernelconcepts.de Git - karo-tx-linux.git/log
karo-tx-linux.git
10 years agoBtrfs: check for actual acls rather than just xattrs when caching no acl
Josef Bacik [Wed, 19 Jun 2013 14:16:26 +0000 (10:16 -0400)]
Btrfs: check for actual acls rather than just xattrs when caching no acl

We have an optimization that will go ahead and cache no acls on an inode if
there are no xattrs on the inode.  This saves us a lookup later to check the
acls for writes or any other access.  The problem is I use selinux so I always
have an xattr on inodes, so make this test a little smarter and check for the
actual acl hash on the key and if it isn't there then we still get to cache no
acl which makes everybody who uses selinux a little happier.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate
Josef Bacik [Mon, 17 Jun 2013 21:14:39 +0000 (17:14 -0400)]
Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate

This has plagued us forever and I'm so over working around it.  When we truncate
down to a non-page aligned offset we will call btrfs_truncate_page to zero out
the end of the page and write it back to disk, this will keep us from exposing
stale data if we truncate back up from that point.  The problem with this is it
requires data space to do this, and people don't really expect to get ENOSPC
from truncate() for these sort of things.  This also tends to bite the orphan
cleanup stuff too which keeps people from mounting.  To get around this we can
just move this into btrfs_cont_expand() to make sure if we are truncating up
from a non-page size aligned i_size we will zero out the rest of this page so
that we don't expose stale data.  This will give ENOSPC if you try to truncate()
up or if you try to write past the end of isize, which is much more reasonable.
This fixes xfstests generic/083 failing to mount because of the orphan cleanup
failing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: optimize reada_for_balance
Josef Bacik [Mon, 17 Jun 2013 18:23:02 +0000 (14:23 -0400)]
Btrfs: optimize reada_for_balance

This patch does two things.  First we no longer explicitly read in the blocks
we're trying to readahead.  For things like balance_level we may never actually
use the blocks so this just adds uneeded latency, and balance_level and
split_node will both read in the blocks they care about explicitly so if the
blocks need to be waited on it will be done there.  Secondly we no longer drop
the path if we do readahead, we just set the path blocking before we call
reada_for_balance() and then we're good to go.  Hopefully this will cut down on
the number of re-searches.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: optimize read_block_for_search
Josef Bacik [Mon, 17 Jun 2013 17:44:48 +0000 (13:44 -0400)]
Btrfs: optimize read_block_for_search

This patch does two things, first it only does one call to
btrfs_buffer_uptodate() with the gen specified instead of once with 0 and then
again with gen specified.  The other thing is to call btrfs_read_buffer() on the
buffer we've found instead of dropping it and then calling read_tree_block().
This will keep us from doing yet another radix tree lookup for a buffer we've
already found.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: unlock extent range on enospc in compressed submit
Josef Bacik [Fri, 14 Jun 2013 20:58:23 +0000 (16:58 -0400)]
Btrfs: unlock extent range on enospc in compressed submit

A user reported a deadlock where the async submit thread was blocked on the
lock_extent() lock, and then everybody behind him was locked on the page lock
for the page he was holding.  Looking at the code I noticed we do not unlock the
extent range when we get ENOSPC and goto retry.  This is bad because we
immediately try to lock that range again to do the cow, which will cause a
deadlock.  Fix this by unlocking the range.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: fix the comment typo for btrfs_attach_transaction_barrier
Wang Sheng-Hui [Fri, 14 Jun 2013 08:21:24 +0000 (16:21 +0800)]
Btrfs: fix the comment typo for btrfs_attach_transaction_barrier

The comment is for btrfs_attach_transaction_barrier, not for
btrfs_attach_transaction. Fix the typo.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Acked-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: fix not being able to find skinny extents during relocate
Josef Bacik [Thu, 13 Jun 2013 17:50:23 +0000 (13:50 -0400)]
Btrfs: fix not being able to find skinny extents during relocate

We unconditionally search for the EXTENT_ITEM_KEY for metadata during balance,
and then check the key that we found to see if it is actually a
METADATA_ITEM_KEY, but this doesn't work right because METADATA is a higher key
value, so if what we are looking for happens to be the first item in the leaf
the search will dump us out at the previous leaf, and we won't find our item.
So instead do what we do everywhere else, search for the skinny extent first and
if we don't find it go back and re-search for the extent item.  This patch fixes
the panic I was hitting when balancing a large file system with skinny extents.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: cleanup backref search commit root flag stuff
Josef Bacik [Wed, 12 Jun 2013 20:20:08 +0000 (16:20 -0400)]
Btrfs: cleanup backref search commit root flag stuff

Looking into this backref problem I noticed we're using a macro to what turns
out to essentially be a NULL check to see if we need to search the commit root.
I'm killing this, let's just do what everybody else does and checks if trans ==
NULL.  I've also made it so we pass in the path to __resolve_indirect_refs which
will have the search_commit_root flag set properly already and that way we can
avoid allocating another path when we have a perfectly good one to use.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: free csums when we're done scrubbing an extent
Josef Bacik [Mon, 10 Jun 2013 12:59:04 +0000 (12:59 +0000)]
Btrfs: free csums when we're done scrubbing an extent

A user reported scrub taking up an unreasonable amount of ram as it ran.  This
is because we lookup the csums for the extent we're scrubbing but don't free it
up until after we're done with the scrub, which means we can take up a whole lot
of ram.  This patch fixes this by dropping the csums once we're done with the
extent we've scrubbed.  The user reported this to fix their problem.  Thanks,

Reported-and-tested-by: Remco Hosman <remco@hosman.xs4all.nl>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: fix transaction throttling for delayed refs
Josef Bacik [Wed, 12 Jun 2013 17:56:06 +0000 (13:56 -0400)]
Btrfs: fix transaction throttling for delayed refs

Dave has this fs_mark script that can make btrfs abort with sufficient amount of
ram.  This is because with more ram we can keep more dirty metadata in cache
which in a round about way makes for many more pending delayed refs.  What
happens is we end up not throttling the transaction enough so when we go to
commit the transaction when we've completely filled the file system we'll
abort() because we use all of the space in the global reserve and we still have
delayed refs to run.  To fix this we need to make the delayed ref flushing and
the transaction throttling dependant upon the number of delayed refs that we
have instead of how much reserved space is left in the global reserve.  With
this patch we not only stop aborting transactions but we also get a smoother run
speed with fs_mark and it makes us about 10% faster.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: stop waiting on current trans if we aborted
Josef Bacik [Mon, 10 Jun 2013 20:47:23 +0000 (16:47 -0400)]
Btrfs: stop waiting on current trans if we aborted

I hit a hang when run_delayed_refs returned an error in the beginning of
btrfs_commit_transaction.  If we decide we need to commit the transaction in
btrfs_end_transaction we'll set BLOCKED and start to commit, but if we get an
error this early on we'll just exit without committing.  This is fine, except
that anybody else who tried to start a transaction will sit in
wait_current_trans() since we're set to BLOCKED and we never set it to something
else and woke people up.  To fix this we want to check for trans->aborted
everywhere we wait for the transaction state to change, and make
btrfs_abort_transaction() wake up any waiters there may be.  All the callers
will notice that the transaction has aborted and exit out properly.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: wake up delayed ref flushing waiters on abort
Josef Bacik [Mon, 10 Jun 2013 15:52:32 +0000 (11:52 -0400)]
Btrfs: wake up delayed ref flushing waiters on abort

I hit a deadlock because we aborted when flushing delayed refs but didn't wake
any of the other flushers up and so everybody was just sleeping forever.  This
should fix the problem.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agobtrfs: fix the code comments for LZO compression workspace
Jie Liu [Thu, 6 Jun 2013 13:38:50 +0000 (13:38 +0000)]
btrfs: fix the code comments for LZO compression workspace

Fix the code comments for lzo compression workspace.
The buf item is used to store the decompressed data
and cbuf is used to store the compressed data.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: fix broken nocow after balance
Miao Xie [Thu, 6 Jun 2013 03:28:03 +0000 (03:28 +0000)]
Btrfs: fix broken nocow after balance

Balance will create reloc_root for each fs root, and it's going to
record last_snapshot to filter shared blocks.  The side effect of
setting last_snapshot is to break nocow attributes of files.

Since the extents are not shared by the relocation tree after the balance,
we can recover the old last_snapshot safely if no one snapshoted the
source tree. We fix the above problem by this way.

Reported-by: Kyle Gates <kylegates@hotmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: exclude logged extents before replying when we are mixed
Josef Bacik [Thu, 6 Jun 2013 17:19:32 +0000 (13:19 -0400)]
Btrfs: exclude logged extents before replying when we are mixed

With non-mixed block groups we replay the logs before we're allowed to do any
writes, so we get away with not pinning/removing the data extents until right
when we replay them.  However with mixed block groups we allocate out of the
same pool, so we could easily allocate a metadata block that was logged in our
tree log.  To deal with this we just need to notice that we have mixed block
groups and do the normal excluding/removal dance during the pin stage of the log
replay and that way we don't allocate metadata blocks from areas we have logged
data extents.  With this patch we now pass xfstests generic/311 with mixed
block groups turned on.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: put our inode if orphan cleanup fails
Josef Bacik [Tue, 4 Jun 2013 01:39:49 +0000 (21:39 -0400)]
Btrfs: put our inode if orphan cleanup fails

When we cross into a different subvol when doing a lookup we will run the orhpan
cleanup.  If this fails however we do not drop the ref to the inode we were
looking up before we return an error, which leads to busy inodes on umount.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: add some missing iput()'s in btrfs_orphan_cleanup
Josef Bacik [Mon, 3 Jun 2013 20:51:23 +0000 (16:51 -0400)]
Btrfs: add some missing iput()'s in btrfs_orphan_cleanup

There are some error cases that we don't do an iput() on our inode, fix this.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: do not pin while under spin lock
Josef Bacik [Mon, 3 Jun 2013 20:42:36 +0000 (16:42 -0400)]
Btrfs: do not pin while under spin lock

When testing a corrupted fs I noticed I was getting sleep while atomic errors
when the transaction aborted.  This is because btrfs_pin_extent may need to
allocate memory and we are calling this under the spin lock.  Fix this by moving
it out and doing the pin after dropping the spin lock but before dropping the
mutex, the same way it works when delayed refs run normally.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: Cocci spatch "memdup.spatch"
Thomas Meyer [Sat, 1 Jun 2013 09:37:50 +0000 (09:37 +0000)]
Btrfs: Cocci spatch "memdup.spatch"

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: Cocci spatch "ptr_ret.spatch"
Thomas Meyer [Sat, 1 Jun 2013 09:45:35 +0000 (09:45 +0000)]
Btrfs: Cocci spatch "ptr_ret.spatch"

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: fix qgroup rescan resume on mount
Jan Schmidt [Tue, 28 May 2013 15:47:24 +0000 (15:47 +0000)]
Btrfs: fix qgroup rescan resume on mount

When called during mount, we cannot start the rescan worker thread until
open_ctree is done. This commit restuctures the qgroup rescan internals to
enable a clean deferral of the rescan resume operation.

First of all, the struct qgroup_rescan is removed, saving us a malloc and
some initialization synchronizations problems. Its only element (the worker
struct) now lives within fs_info just as the rest of the rescan code.

Then setting up a rescan worker is split into several reusable stages.
Currently we have three different rescan startup scenarios:
(A) rescan ioctl
(B) rescan resume by mount
(C) rescan by quota enable

Each case needs its own combination of the four following steps:
(1) set the progress [A, C: zero; B: state of umount]
(2) commit the transaction [A]
(3) set the counters [A, C: zero; B: state of umount]
(4) start worker [A, B, C]

qgroup_rescan_init does step (1). There's no extra function added to commit
a transaction, we've got that already. qgroup_rescan_zero_tracking does
step (3). Step (4) is nothing more than a call to the generic
btrfs_queue_worker.

We also get rid of a double check for the rescan progress during
btrfs_qgroup_account_ref, which is no longer required due to having step 2
from the list above.

As a side effect, this commit prepares to move the rescan start code from
btrfs_run_qgroups (which is run during commit) to a less time critical
section.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: avoid double free of fs_info->qgroup_ulist
Jan Schmidt [Tue, 28 May 2013 15:47:23 +0000 (15:47 +0000)]
Btrfs: avoid double free of fs_info->qgroup_ulist

When btrfs_read_qgroup_config or btrfs_quota_enable return non-zero, we've
already freed the fs_info->qgroup_ulist. The final btrfs_free_qgroup_config
called from quota_disable makes another ulist_free(fs_info->qgroup_ulist)
call.

We set fs_info->qgroup_ulist to NULL on the mentioned error paths, turning
the ulist_free in btrfs_free_qgroup_config into a noop.

Cc: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: fix memory patcher through fs_info->qgroup_ulist
Jan Schmidt [Tue, 28 May 2013 15:47:22 +0000 (15:47 +0000)]
Btrfs: fix memory patcher through fs_info->qgroup_ulist

Commit 5b7c665e introduced fs_info->qgroup_ulist, that is allocated during
btrfs_read_qgroup_config and meant to be used later by the qgroup accounting
code. However, it is always freed before btrfs_read_qgroup_config returns,
becuase the commit mentioned above adds a check for (ret), where a check
for (ret < 0) would have been the right choice. This commit fixes the check.

Cc: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: simplify unlink reservations
Josef Bacik [Wed, 29 May 2013 18:54:47 +0000 (14:54 -0400)]
Btrfs: simplify unlink reservations

Dave pointed out a problem where if you filled up a file system as much as
possible you couldn't remove any files.  The whole unlink reservation thing is
convoluted because it tries to guess if it's going to add space to unlink
something or not, and has all these odd uncommented cases where it simply does
not try.  So to fix this I've added a way to conditionally steal from the global
reserve if we can't make our normal reservation.  If we have more than half the
space in the global reserve free we will go ahead and steal from the global
reserve.  With this patch Dave's reproducer now works and I can rm all the files
on the file system.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: merge pending IO for tree log write back
Miao Xie [Tue, 28 May 2013 10:05:39 +0000 (10:05 +0000)]
Btrfs: merge pending IO for tree log write back

Before applying this patch, we flushed the log tree of the fs/file
tree firstly, and then flushed the log root tree. It is ineffective,
especially on the hard disk. This patch improved this problem by wrapping
the above two flushes by the same blk_plug.

By test, the performance of the sync write went up ~60%(2.9MB/s -> 4.6MB/s)
on my scsi disk whose disk buffer was enabled.

Test step:
 # mkfs.btrfs -f -m single <disk>
 # mount <disk> <mnt>
 # dd if=/dev/zero of=<mnt>/file0 bs=32K count=1024 oflag=sync

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: allow file data clone within a file
Liu Bo [Sun, 26 May 2013 13:50:31 +0000 (13:50 +0000)]
Btrfs: allow file data clone within a file

We did not allow file data clone within the same file because of
deadlock issues.

However, we now use nested lock to avoid deadlock between the
parent directory and the child file.

So it's safe to do file clone within the same file when the two
ranges are not overlapped.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: remove unused code in btrfs_del_root
Liu Bo [Sun, 26 May 2013 13:50:30 +0000 (13:50 +0000)]
Btrfs: remove unused code in btrfs_del_root

'leaf' and 'ri' is not used somehow.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: kill replicate code in replay_one_buffer
Liu Bo [Sun, 26 May 2013 13:50:29 +0000 (13:50 +0000)]
Btrfs: kill replicate code in replay_one_buffer

EXTREF is treated same as REF, so we can make the code tidy.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: update new flags for tracepoint
Liu Bo [Sun, 26 May 2013 13:50:28 +0000 (13:50 +0000)]
Btrfs: update new flags for tracepoint

Adding new flags to keep tracepoints consistent with btrfs.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: check if leaf's parent exists before pushing items around
Liu Bo [Wed, 22 May 2013 12:07:06 +0000 (12:07 +0000)]
Btrfs: check if leaf's parent exists before pushing items around

During splitting a leaf, pushing items around to hopefully get some space only
works when we have a parent, ie. we have at least one sibling leaf.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: dont do log_removal in insert_new_root
Liu Bo [Wed, 22 May 2013 12:06:51 +0000 (12:06 +0000)]
Btrfs: dont do log_removal in insert_new_root

As for splitting a leaf, root is just the leaf, and tree mod log does not apply
on leaf, so in this case, we don't do log_removal.

As for splitting a node, the old root is kept as a normal node and we have nicely
put records in tree mod log for moving keys and items, so in this case we don't do
that either.

As above, insert_new_root can get rid of log_removal.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: return error code in btrfs_check_trunc_cache_free_space()
Wei Yongjun [Tue, 21 May 2013 02:39:21 +0000 (02:39 +0000)]
Btrfs: return error code in btrfs_check_trunc_cache_free_space()

Fix to return error code instead always return 0 from function
btrfs_check_trunc_cache_free_space().
Introduced by commit 7b61cd92242542944fc27024900c495a6a7b3396
(Btrfs: don't use global block reservation for inode cache truncation)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: fix estale with btrfs send
Josef Bacik [Mon, 20 May 2013 15:26:50 +0000 (11:26 -0400)]
Btrfs: fix estale with btrfs send

This fixes bugzilla 57491.  If we take a snapshot of a fs with a unlink ongoing
and then try to send that root we will run into problems.  When comparing with a
parent root we will search the parents and the send roots commit_root, which if
we've just created the snapshot will include the file that needs to be evicted
by the orphan cleanup.  So when we find a changed extent we will try and copy
that info into the send stream, but when we lookup the inode we use the normal
root, which no longer has the inode because the orphan cleanup deleted it.  The
best solution I have for this is to check our otransid with the generation of
the commit root and if they match just commit the transaction again, that way we
get the changes from the orphan cleanup.  With this patch the reproducer I made
for this bugzilla no longer returns ESTALE when trying to do the send.  Thanks,

Cc: stable@vger.kernel.org
Reported-by: Chris Wilson <jakdaw@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agobtrfs: device delete to get errors from the kernel
Anand Jain [Fri, 17 May 2013 10:52:45 +0000 (10:52 +0000)]
btrfs: device delete to get errors from the kernel

when user runs command btrfs dev del the raid requisite error if any
goes to the /var/log/messages, its not good idea to clutter messages
with these user (knowledge) errors, further user don't have to review
the system messages to know problem with the cli it should be dropped
to the user as part of the cli return.

to bring this feature created a set of the ERROR defined
BTRFS_ERROR_DEV* error codes and created their error string.

I expect this enum to be added with other error which we might
want to communicate to the user land

v3:
moved the code with in the file no logical change

v1->v2:
introduce error codes for the device mgmt usage

v1:
adds a parameter in the ioctl arg struct to carry the error string

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: do delay iput in sync_fs
Josef Bacik [Thu, 16 May 2013 15:14:33 +0000 (11:14 -0400)]
Btrfs: do delay iput in sync_fs

We get lock inversion with umount if we allow iputs from sync_fs, so use the
delay iput flag to keep this from happening.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: make the state of the transaction more readable
Miao Xie [Fri, 17 May 2013 03:53:43 +0000 (03:53 +0000)]
Btrfs: make the state of the transaction more readable

We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.

This patch improved the above problem. In this patch, we define 6 states
for the transaction,
  enum btrfs_trans_state {
TRANS_STATE_RUNNING = 0,
TRANS_STATE_BLOCKED = 1,
TRANS_STATE_COMMIT_START = 2,
TRANS_STATE_COMMIT_DOING = 3,
TRANS_STATE_UNBLOCKED = 4,
TRANS_STATE_COMPLETED = 5,
TRANS_STATE_MAX = 6,
  }
and just use 1 variant to track those state.

In order to make the blocked handle types for each state more clear,
we introduce a array:
  unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
[TRANS_STATE_RUNNING] = 0U,
[TRANS_STATE_BLOCKED] = (__TRANS_USERSPACE |
   __TRANS_START),
[TRANS_STATE_COMMIT_START] = (__TRANS_USERSPACE |
   __TRANS_START |
   __TRANS_ATTACH),
[TRANS_STATE_COMMIT_DOING] = (__TRANS_USERSPACE |
   __TRANS_START |
   __TRANS_ATTACH |
   __TRANS_JOIN),
[TRANS_STATE_UNBLOCKED] = (__TRANS_USERSPACE |
   __TRANS_START |
   __TRANS_ATTACH |
   __TRANS_JOIN |
   __TRANS_JOIN_NOLOCK),
[TRANS_STATE_COMPLETED] = (__TRANS_USERSPACE |
   __TRANS_START |
   __TRANS_ATTACH |
   __TRANS_JOIN |
   __TRANS_JOIN_NOLOCK),
  }
it is very intuitionistic.

Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: remove the time check in btrfs_commit_transaction()
Miao Xie [Wed, 15 May 2013 07:48:30 +0000 (07:48 +0000)]
Btrfs: remove the time check in btrfs_commit_transaction()

We checked the commit time to avoid committing the transaction
frequently, but it is unnecessary because:
- It made the transaction commit spend more time, and delayed the
  operation of the external writers(TRANS_START/TRANS_USERSPACE).
- Except the space that we have to commit transaction, such as
  snapshot creation, btrfs doesn't commit the transaction on its
  own initiative.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: remove unnecessary varient ->num_joined in btrfs_transaction structure
Miao Xie [Wed, 15 May 2013 07:48:29 +0000 (07:48 +0000)]
Btrfs: remove unnecessary varient ->num_joined in btrfs_transaction structure

We used ->num_joined track if there were some writers which join the current
transaction when the committer was sleeping. If some writers joined the current
transaction, we has to continue the while loop to do some necessary stuff, such
as flush the ordered operations. But it is unnecessary because we will do it
after the while loop.

Besides that, tracking ->num_joined would make the committer drop into the while
loop when there are lots of internal writers(TRANS_JOIN).

So we remove ->num_joined and don't track if there are some writers which join
the current transaction when the committer is sleeping.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: don't flush the delalloc inodes in the while loop if flushoncommit is set
Miao Xie [Wed, 15 May 2013 07:48:28 +0000 (07:48 +0000)]
Btrfs: don't flush the delalloc inodes in the while loop if flushoncommit is set

It is unnecessary to flush the delalloc inodes again and again because
we don't care the dirty pages which are introduced after the flush, and
they will be flush in the transaction commit.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: don't wait for all the writers circularly during the transaction commit
Miao Xie [Wed, 15 May 2013 07:48:27 +0000 (07:48 +0000)]
Btrfs: don't wait for all the writers circularly during the transaction commit

btrfs_commit_transaction has the following loop before we commit the
transaction.

do {
    // attempt to do some useful stuff and/or sleep
} while (atomic_read(&cur_trans->num_writers) > 1 ||
 (should_grow && cur_trans->num_joined != joined));

This is used to prevent from the TRANS_START to get in the way of a
committing transaction. But it does not prevent from TRANS_JOIN, that
is we would do this loop for a long time if some writers JOIN the
current transaction endlessly.

Because we need join the current transaction to do some useful stuff,
we can not block TRANS_JOIN here. So we introduce a external writer
counter, which is used to count the TRANS_USERSPACE/TRANS_START writers.
If the external writer counter is zero, we can break the above loop.

In order to make the code more clear, we don't use enum variant
to define the type of the transaction handle, use bitmask instead.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: remove the code for the impossible case in cleanup_transaction()
Miao Xie [Wed, 15 May 2013 07:48:26 +0000 (07:48 +0000)]
Btrfs: remove the code for the impossible case in cleanup_transaction()

If the transaction is removed from the transaction list, it means the
transaction has been committed successfully. So it is impossible to
call cleanup_transaction(), otherwise there is something wrong with
the code logic. Thus, we use BUG_ON() instead of the original handle.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: cleanup unnecessary assignment when cleaning up all the residual transaction
Miao Xie [Wed, 15 May 2013 07:48:25 +0000 (07:48 +0000)]
Btrfs: cleanup unnecessary assignment when cleaning up all the residual transaction

When we umount a fs with serious errors, we will invoke btrfs_cleanup_transactions()
to clean up the residual transaction. At this time, It is impossible to start a new
transaction, so we needn't assign trans_no_join to 1, and also needn't clear running
transaction every time we destroy a residual transaction.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: just flush the delalloc inodes in the source tree before snapshot creation
Miao Xie [Wed, 15 May 2013 07:48:24 +0000 (07:48 +0000)]
Btrfs: just flush the delalloc inodes in the source tree before snapshot creation

Before applying this patch, we need flush all the delalloc inodes in
the fs when we want to create a snapshot, it wastes time, and make
the transaction commit be blocked for a long time. It means some other
user operation would also be blocked for a long time.

This patch improves this problem, we just flush the delalloc inodes that
in the source trees before snapshot creation, so the transaction commit
will complete quickly.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: introduce per-subvolume ordered extent list
Miao Xie [Wed, 15 May 2013 07:48:23 +0000 (07:48 +0000)]
Btrfs: introduce per-subvolume ordered extent list

The reason we introduce per-subvolume ordered extent list is the same
as the per-subvolume delalloc inode list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: introduce per-subvolume delalloc inode list
Miao Xie [Wed, 15 May 2013 07:48:22 +0000 (07:48 +0000)]
Btrfs: introduce per-subvolume delalloc inode list

When we create a snapshot, we need flush all delalloc inodes in the
fs, just flushing the inodes in the source tree is OK. So we introduce
per-subvolume delalloc inode list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: introduce grab/put functions for the root of the fs/file tree
Miao Xie [Wed, 15 May 2013 07:48:20 +0000 (07:48 +0000)]
Btrfs: introduce grab/put functions for the root of the fs/file tree

The grab/put funtions will be used in the next patch, which need grab
the root object and ensure it is not freed. We use reference counter
instead of the srcu lock is to aovid blocking the memory reclaim task,
which invokes synchronize_srcu().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: cleanup the similar code of the fs root read
Miao Xie [Wed, 15 May 2013 07:48:19 +0000 (07:48 +0000)]
Btrfs: cleanup the similar code of the fs root read

There are several functions whose code is similar, such as
  btrfs_find_last_root()
  btrfs_read_fs_root_no_radix()

Besides that, some functions are invoked twice, it is unnecessary,
for example, we are sure that all roots which is found in
  btrfs_find_orphan_roots()
have their orphan items, so it is unnecessary to check the orphan
item again.

So cleanup it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: make the snap/subv deletion end more early when the fs is R/O
Miao Xie [Tue, 14 May 2013 10:20:43 +0000 (10:20 +0000)]
Btrfs: make the snap/subv deletion end more early when the fs is R/O

The snapshot/subvolume deletion might spend lots of time, it would make
the remount task wait for a long time. This patch improve this problem,
we will break the deletion if the fs is remounted to be R/O. It will make
the users happy.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: move the R/O check out of btrfs_clean_one_deleted_snapshot()
Miao Xie [Tue, 14 May 2013 10:20:42 +0000 (10:20 +0000)]
Btrfs: move the R/O check out of btrfs_clean_one_deleted_snapshot()

If the fs is remounted to be R/O, it is unnecessary to call
btrfs_clean_one_deleted_snapshot(), so move the R/O check out of
this function. And besides that, it can make the check logic in the
caller more clear.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: make the cleaner complete early when the fs is going to be umounted
Miao Xie [Tue, 14 May 2013 10:20:41 +0000 (10:20 +0000)]
Btrfs: make the cleaner complete early when the fs is going to be umounted

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: remove unnecessary ->s_umount in cleaner_kthread()
Miao Xie [Tue, 14 May 2013 10:20:40 +0000 (10:20 +0000)]
Btrfs: remove unnecessary ->s_umount in cleaner_kthread()

In order to avoid the R/O remount, we acquired ->s_umount lock during
we deleted the dead snapshots and subvolumes. But it is unnecessary,
because we have cleaner_mutex.

We use cleaner_mutex to protect the process of the dead snapshots/subvolumes
deletion. And when we remount the fs to be R/O, we also acquire this mutex to
do cleanup after we change the status of the fs. That is this lock can serialize
the above operations, the cleaner can be aware of the status of the fs, and if
the cleaner is deleting the dead snapshots/subvolumes, the remount task will
wait for it. So it is safe to remove ->s_umount in cleaner_kthread().

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: cleanup: don't check the same thing twice
Stefan Behrens [Mon, 13 May 2013 13:53:35 +0000 (13:53 +0000)]
Btrfs: cleanup: don't check the same thing twice

btrfs_read_fs_root_no_name() already checks if btrfs_root_refs()
is zero and returns ENOENT in this case. There is no need to do
it again in six places.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: cleanup, btrfs_read_fs_root_no_name() doesn't return NULL
Stefan Behrens [Mon, 13 May 2013 14:42:57 +0000 (14:42 +0000)]
Btrfs: cleanup, btrfs_read_fs_root_no_name() doesn't return NULL

No need to check for NULL in send.c and disk-io.c.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: delete unused function
Stefan Behrens [Mon, 13 May 2013 13:53:33 +0000 (13:53 +0000)]
Btrfs: delete unused function

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: remove useless copy in quota_ctl
Liu Bo [Mon, 13 May 2013 11:10:44 +0000 (11:10 +0000)]
Btrfs: remove useless copy in quota_ctl

We don't need to copy it back to user side as it remains unchanged.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoMinor format cleanup.
Andreas Philipp [Sat, 11 May 2013 11:12:54 +0000 (11:12 +0000)]
Minor format cleanup.

Clean up the format of the definitions of BTRFS_BLOCK_GROUP_RAID5 and
BTRFS_BLOCK_GROUP_RAID6.

Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: cleanup unused arguments in send.c
Tsutomu Itoh [Wed, 8 May 2013 07:51:52 +0000 (07:51 +0000)]
Btrfs: cleanup unused arguments in send.c

sctx is removed from the argument of the function that
doesn't use sctx.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: fix a comment
Stefan Behrens [Tue, 7 May 2013 10:23:30 +0000 (10:23 +0000)]
Btrfs: fix a comment

The size parameter to btrfs_extend_item() is the number of bytes
to add to the item, not the size of the item after the operation
(like it is for btrfs_truncate_item(), there the size parameter
is not the number of bytes to take away, but the total size of
the item after truncation).
Fix it in the comment.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: add ioctl to wait for qgroup rescan completion
Jan Schmidt [Mon, 6 May 2013 19:14:17 +0000 (19:14 +0000)]
Btrfs: add ioctl to wait for qgroup rescan completion

btrfs_qgroup_wait_for_completion waits until the currently running qgroup
operation completes. It returns immediately when no rescan process is in
progress. This is useful to automate things around the rescan process (e.g.
testing).

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: introduce qgroup_ulist to avoid frequently allocating/freeing ulist
Wang Shilong [Mon, 6 May 2013 11:03:27 +0000 (11:03 +0000)]
Btrfs: introduce qgroup_ulist to avoid frequently allocating/freeing ulist

When doing qgroup accounting, we call ulist_alloc()/ulist_free() every time
when we want to walk qgroup tree.

By introducing 'qgroup_ulist', we only need to call ulist_alloc()/ulist_free()
once. This reduce some sys time to allocate memory, see the measurements below

fsstress -p 4 -n 10000 -d $dir

With this patch:

real    0m50.153s
user    0m0.081s
sys     0m6.294s

real    0m51.113s
user    0m0.092s
sys     0m6.220s

real    0m52.610s
user    0m0.096s
sys     0m6.125s avg 6.213
-----------------------------------------------------
Without the patch:

real    0m54.825s
user    0m0.061s
sys     0m10.665s

real    1m6.401s
user    0m0.089s
sys     0m11.218s

real    1m13.768s
user    0m0.087s
sys     0m10.665s       avg 10.849

we can see the sys time reduce ~43%.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agobtrfs: show compiled-in config features at module load time
David Sterba [Tue, 30 Apr 2013 16:51:59 +0000 (16:51 +0000)]
btrfs: show compiled-in config features at module load time

We want to know if there are debugging features compiled in, this may
affect performance. The message is printed before the sanity checks.
Also kill version.h file that serves no purpose, we don't use any
version tag for kernel module.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agobtrfs: move ifdef around sanity checks out of init_btrfs_fs
David Sterba [Tue, 30 Apr 2013 16:51:58 +0000 (16:51 +0000)]
btrfs: move ifdef around sanity checks out of init_btrfs_fs

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agobtrfs: add prefix to sanity tests messages
David Sterba [Tue, 30 Apr 2013 16:51:57 +0000 (16:51 +0000)]
btrfs: add prefix to sanity tests messages

And change the message level to KERN_INFO.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agobtrfs: add debug check for extent_io range alignment
David Sterba [Tue, 30 Apr 2013 15:22:23 +0000 (15:22 +0000)]
btrfs: add debug check for extent_io range alignment

The 'end' value must exactly cover the end of the interval, which means
one byte less than the expected block alignment, or in case of a file
smaller than one block, one byte less than the inode size.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: fix check on same raid type flag twice
Henrik Nordvik [Mon, 29 Apr 2013 18:09:23 +0000 (18:09 +0000)]
Btrfs: fix check on same raid type flag twice

Code checked for raid 5 flag in two else-if branches, so code would never be reached. Probably a copy-paste bug.

Signed-off-by: Henrik Nordvik <henrikno@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: stop all workers before cleaning up roots
Josef Bacik [Thu, 30 May 2013 20:55:44 +0000 (16:55 -0400)]
Btrfs: stop all workers before cleaning up roots

Dave reported a panic because the extent_root->commit_root was NULL in the
caching kthread.  That is because we just unset it in free_root_pointers, which
is not the correct thing to do, we have to either wait for the caching kthread
to complete or hold the extent_commit_sem lock so we know the thread has exited.
This patch makes the kthreads all stop first and then we do our cleanup.  This
should fix the race.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: fix use-after-free bug during umount
Liu Bo [Sun, 26 May 2013 13:50:27 +0000 (13:50 +0000)]
Btrfs: fix use-after-free bug during umount

Commit be283b2e674a09457d4563729015adb637ce7cc1
(    Btrfs: use helper to cleanup tree roots) introduced the following bug,

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000034
 IP: [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]
[...]
 Pid: 2463, comm: btrfs-cache-1 Tainted: G           O 3.9.0+ #4 innotek GmbH VirtualBox/VirtualBox
 RIP: 0010:[<ffffffffa039368c>]  [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]
 Process btrfs-cache-1 (pid: 2463, threadinfo ffff880112d60000, task ffff880117679730)
[...]
 Call Trace:
  [<ffffffffa0398a99>] btrfs_search_slot+0x104/0x64d [btrfs]
  [<ffffffffa039aea4>] btrfs_next_old_leaf+0xa7/0x334 [btrfs]
  [<ffffffffa039b141>] btrfs_next_leaf+0x10/0x12 [btrfs]
  [<ffffffffa039ea13>] caching_thread+0x1a3/0x2e0 [btrfs]
  [<ffffffffa03d8811>] worker_loop+0x14b/0x48e [btrfs]
  [<ffffffffa03d86c6>] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
  [<ffffffff81068d3d>] kthread+0x8d/0x95
  [<ffffffff81068cb0>] ? kthread_freezable_should_stop+0x43/0x43
  [<ffffffff8151e5ac>] ret_from_fork+0x7c/0xb0
  [<ffffffff81068cb0>] ? kthread_freezable_should_stop+0x43/0x43
RIP  [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]

We've free'ed commit_root before actually getting to free block groups where
caching thread needs valid extent_root->commit_root.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
10 years agoBtrfs: init relocate extent_io_tree with a mapping
Josef Bacik [Fri, 31 May 2013 17:04:36 +0000 (13:04 -0400)]
Btrfs: init relocate extent_io_tree with a mapping

Dave reported a NULL pointer deref.  This is caused because he thought he'd be
smart and add sanity checks to the extent_io bit operations, but he didn't
expect a tree to have a NULL mapping.  To fix this we just need to init the
relocation's processed_blocks with the btree_inode->i_mapping.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
10 years agobtrfs: Drop inode if inode root is NULL
Naohiro Aota [Thu, 6 Jun 2013 09:56:34 +0000 (09:56 +0000)]
btrfs: Drop inode if inode root is NULL

There is a path where btrfs_drop_inode() is called with its inode's root
is NULL: In btrfs_new_inode(), when btrfs_set_inode_index() fails,
iput() is called. We should handle this case before taking look at the
root->root_item.

Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
10 years agoBtrfs: don't delete fs_roots until after we cleanup the transaction
Josef Bacik [Thu, 6 Jun 2013 14:29:40 +0000 (10:29 -0400)]
Btrfs: don't delete fs_roots until after we cleanup the transaction

We get a use after free if we had a transaction to cleanup since there could be
delayed inodes which refer to their respective fs_root.  Thanks

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
11 years agoMerge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs...
Chris Mason [Sat, 18 May 2013 01:53:17 +0000 (21:53 -0400)]
Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next

11 years agoBtrfs: use a btrfs bioset instead of abusing bio internals
Chris Mason [Fri, 17 May 2013 22:30:14 +0000 (18:30 -0400)]
Btrfs: use a btrfs bioset instead of abusing bio internals

Btrfs has been pointer tagging bi_private and using bi_bdev
to store the stripe index and mirror number of failed IOs.

As bios bubble back up through the call chain, we use these
to decide if and how to retry our IOs.  They are also used
to count IO failures on a per device basis.

Recently a bio tracepoint was added lead to crashes because
we were abusing bi_bdev.

This commit adds a btrfs bioset, and creates explicit fields
for the mirror number and stripe index.  The plan is to
extend this structure for all of the fields currently in
struct btrfs_bio, which will mean one less kmalloc in
our IO path.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Tejun Heo <tj@kernel.org>
11 years agoBtrfs: make sure roots are assigned before freeing their nodes
Josef Bacik [Fri, 17 May 2013 18:06:51 +0000 (14:06 -0400)]
Btrfs: make sure roots are assigned before freeing their nodes

If we fail to load the chunk tree we'll call free_root_pointers, except we may
not have assigned the roots for the dev_root/extent_root/csum_root yet, so we
could NULL pointer deref at this point.  Just add checks to make sure these
roots are set to keep us from panicing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: explicitly use global_block_rsv for quota_tree
Stefan Behrens [Thu, 16 May 2013 14:48:19 +0000 (14:48 +0000)]
Btrfs: explicitly use global_block_rsv for quota_tree

The quota_tree was set up to use the empty_block_rsv before
which would be problematic when the filesystem is filled up
and ENOSPC happens during internal operations while the quota
tree is updated and COWed (when the btrfs_qgroup_info_item
items) are written. In fact, use_block_rsv() which is used
in btrfs_cow_block() falls back to the global_block_rsv in
this case. But just in order to make it more clear what is
happening, change it to explicitly use the global_block_rsv.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agobtrfs: do away with non-whole_page extent I/O
Alexandre Oliva [Wed, 15 May 2013 15:38:55 +0000 (11:38 -0400)]
btrfs: do away with non-whole_page extent I/O

end_bio_extent_readpage computes whole_page based on bv_offset and
bv_len, without taking into account that blk_update_request may modify
them when some of the blocks to be read into a page produce a read
error.  This would cause the read to unlock only part of the file
range associated with the page, which would in turn leave the entire
page locked, which would not only keep the process blocked instead of
returning -EIO to it, but also prevent any further access to the file.

It turns out that btrfs always issues whole-page reads and writes.
The special handling of non-whole_page appears to be a mistake or a
left-over from a time when this wasn't the case.  Indeed,
end_bio_extent_writepage distinguished between whole_page and
non-whole_page writes but behaved identically in both cases!

I've replaced the whole_page computations with warnings, just to be
sure that we're not issuing partial page reads or writes.  The
warnings should probably just go away some time.

Signed-off-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context
Miao Xie [Wed, 15 May 2013 07:48:21 +0000 (07:48 +0000)]
Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context

btrfs_invalidate_inodes() may sleep, so we should not invoke it in the
spin lock context. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix()
Miao Xie [Wed, 15 May 2013 07:48:18 +0000 (07:48 +0000)]
Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix()

We have checked if ->node is NULL or not, so it is unnecessary to
use BUG_ON() to check again. Remove it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: pause the space balance when remounting to R/O
Miao Xie [Wed, 15 May 2013 07:48:17 +0000 (07:48 +0000)]
Btrfs: pause the space balance when remounting to R/O

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: fix unprotected root node of the subvolume's inode rb-tree
Miao Xie [Wed, 15 May 2013 07:48:16 +0000 (07:48 +0000)]
Btrfs: fix unprotected root node of the subvolume's inode rb-tree

The root node of the rb-tree may be changed, so we should get it under
the lock. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: fix accessing a freed tree root
Miao Xie [Wed, 15 May 2013 07:48:15 +0000 (07:48 +0000)]
Btrfs: fix accessing a freed tree root

inode_tree_del() will move the tree root into the dead root list, and
then the tree will be destroyed by the cleaner. So if we remove the
delayed node which is cached in the inode after inode_tree_del(),
we may access a freed tree root. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: return errno if possible when we fail to allocate memory
Liu Bo [Tue, 14 May 2013 02:12:15 +0000 (02:12 +0000)]
Btrfs: return errno if possible when we fail to allocate memory

We need to set return value explicitly, otherwise we'll lose the error
value.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: update the global reserve if it is empty
Miao Xie [Mon, 13 May 2013 13:55:12 +0000 (13:55 +0000)]
Btrfs: update the global reserve if it is empty

Before applying this patch, we reserved the space for the global reserve
by the minimum unit if we found it is empty, it was unreasonable and
inefficient, because if the global reserve space was depleted, it implied
that the size of the global reserve was too small. In this case, we shoud
update the global reserve and fill it.

Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: don't steal the reserved space from the global reserve if their space type...
Miao Xie [Mon, 13 May 2013 13:55:11 +0000 (13:55 +0000)]
Btrfs: don't steal the reserved space from the global reserve if their space type is different

If the type of the space we need is different with the global reserve, we
can not steal the space from the global reserve, because we can not allocate
the space from the free space cache that the global reserve points to.

Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: optimize the error handle of use_block_rsv()
Miao Xie [Mon, 13 May 2013 13:55:10 +0000 (13:55 +0000)]
Btrfs: optimize the error handle of use_block_rsv()

cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: don't use global block reservation for inode cache truncation
Miao Xie [Mon, 13 May 2013 13:55:09 +0000 (13:55 +0000)]
Btrfs: don't use global block reservation for inode cache truncation

It is very likely that there are lots of subvolumes/snapshots in the filesystem,
so if we use global block reservation to do inode cache truncation, we may hog
all the free space that is reserved in global rsv. So it is better that we do
the free space reservation for inode cache truncation by ourselves.

Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: don't abort the current transaction if there is no enough space for inode...
Miao Xie [Mon, 13 May 2013 13:55:08 +0000 (13:55 +0000)]
Btrfs: don't abort the current transaction if there is no enough space for inode cache

The filesystem with inode cache was forced to be read-only when we umounted it.

Steps to reproduce:
 # mkfs.btrfs -f ${DEV}
 # mount -o inode_cache ${DEV} ${MNT}
 # dd if=/dev/zero of=${MNT}/file1 bs=1M count=8192
 # btrfs fi syn ${MNT}
 # dd if=${MNT}/file1 of=/dev/null bs=1M
 # rm -f ${MNT}/file1
 # btrfs fi syn ${MNT}
 # umount ${MNT}

It is because there was no enough space to do inode cache truncation, and then
we aborted the current transaction.

But no space error is not a serious problem when we write out the inode cache,
and it is safe that we just skip this step if we meet this problem. So we need
not abort the current transaction.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoCorrect allowed raid levels on balance.
Andreas Philipp [Sat, 11 May 2013 11:13:03 +0000 (11:13 +0000)]
Correct allowed raid levels on balance.

Raid5 with 3 devices is well defined while the old logic allowed
raid5 only with a minimum of 4 devices when converting the block group
profile via btrfs balance. Creating a raid5 with just three devices
using mkfs.btrfs worked always as expected. This is now fixed and the
whole logic is rewritten.

Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: fix possible memory leak in replace_path()
Stefan Behrens [Wed, 8 May 2013 08:56:09 +0000 (08:56 +0000)]
Btrfs: fix possible memory leak in replace_path()

In replace_path(), if read_tree_block() fails, we cannot return
directly, we should free some allocated memory otherwise memory
leak happens.

Similar to Wang's "Btrfs: fix possible memory leak in the
find_parent_nodes()" patch, the current commit fixes an issue that
is related to the "Btrfs: fix all callers of read_tree_block"
commit.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: fix possible memory leak in the find_parent_nodes()
Wang Shilong [Wed, 8 May 2013 08:10:25 +0000 (08:10 +0000)]
Btrfs: fix possible memory leak in the find_parent_nodes()

In the find_parent_nodes(), if read_tree_block() fails, we can
not return directly, we should free some allocated memory otherwise
memory leak happens.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: don't allow device replace on RAID5/RAID6
Stefan Behrens [Tue, 7 May 2013 17:28:03 +0000 (17:28 +0000)]
Btrfs: don't allow device replace on RAID5/RAID6

This is not yet supported and causes crashes. One sad user reported
that it destroyed his filesystem.

One failure is in __btrfs_map_block+0xc1f calling kmalloc(0).

0x5f21f is in __btrfs_map_block (fs/btrfs/volumes.c:4923).
4918                            num_stripes = map->num_stripes;
4919                            max_errors = nr_parity_stripes(map);
4920
4921                            raid_map = kmalloc(sizeof(u64) * num_stripes,
4922                                               GFP_NOFS);
4923                            if (!raid_map) {
4924                                    ret = -ENOMEM;
4925                                    goto out;
4926                            }
4927

There might be more issues. Until this is really tested, don't allow
users to start the procedure on RAID5/RAID6 filesystems.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: handle running extent ops with skinny metadata
Josef Bacik [Thu, 9 May 2013 17:49:30 +0000 (13:49 -0400)]
Btrfs: handle running extent ops with skinny metadata

Chris hit a bug where we weren't finding extent records when running extent ops.
This is because we use the delayed_ref_head when running the extent op, which
means we can't use the ->type checks to see if we are metadata.  We also lose
the level of the metadata we are working on.  So to fix this we can just check
the ->is_data section of the extent_op, and we can store the level of the buffer
we were modifying in the extent_op.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: remove warn on in free space cache writeout
Josef Bacik [Wed, 8 May 2013 20:44:57 +0000 (16:44 -0400)]
Btrfs: remove warn on in free space cache writeout

This catches block groups that are too large to properly cache.  We deal with
this case fine, so the warning just confuses users.  Remove the warning.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: don't null pointer deref on abort
Josef Bacik [Wed, 8 May 2013 17:30:11 +0000 (13:30 -0400)]
Btrfs: don't null pointer deref on abort

I'm sorry, theres no excuse for this sort of work.  We need to use
root->leafsize since eb may be NULL.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agobtrfs: don't stop searching after encountering the wrong item
Gabriel de Perthuis [Mon, 6 May 2013 17:40:18 +0000 (17:40 +0000)]
btrfs: don't stop searching after encountering the wrong item

The search ioctl skips items that are too large for a result buffer, but
inline items of a certain size occuring before any search result is
found would trigger an overflow and stop the search entirely.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=57641

Cc: stable@vger.kernel.org
Signed-off-by: Gabriel de Perthuis <g2p.code+btrfs@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: fix off-by-one in fiemap
Liu Bo [Wed, 1 May 2013 16:23:41 +0000 (16:23 +0000)]
Btrfs: fix off-by-one in fiemap

lock_extent/unlock_extent expect an exclusive end.

Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agobtrfs: annotate quota tree for lockdep
David Sterba [Tue, 30 Apr 2013 17:29:29 +0000 (17:29 +0000)]
btrfs: annotate quota tree for lockdep

Quota tree has been missing from lockdep annotations, though no warning
has been seen in the wild.

There's currently one entry that does not belong there,
BTRFS_ORPHAN_OBJECTID.  No such tree exists, it's probably a copy &
paste mistake, the id is defined among tree ids.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: allow superblock mismatch from older mkfs
Chris Mason [Tue, 7 May 2013 15:00:13 +0000 (11:00 -0400)]
Btrfs: allow superblock mismatch from older mkfs

We've added new checks to make sure the super block crc is correct
during mount.  A fresh filesystem from an older mkfs won't have the
crc set.  This adds a warning when it finds a newly created filesystem
but doesn't fail the mount.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
11 years agobtrfs: enhance superblock checks
David Sterba [Wed, 6 Mar 2013 14:57:46 +0000 (15:57 +0100)]
btrfs: enhance superblock checks

The superblock checksum is not verified upon mount. <awkward silence>

Add that check and also reorder existing checks to a more logical
order.

Current mkfs.btrfs does not calculate the correct checksum of
super_block and thus a freshly created filesytem will fail to mount when
this patch is applied.

First transaction commit calculates correct superblock checksum and
saves it to disk.

Reproducer:
$ mfks.btrfs /dev/sda
$ mount /dev/sda /mnt
$ btrfs scrub start /mnt
$ sleep 5
$ btrfs scrub status /mnt
... super:2 ...

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
11 years agobtrfs: fix misleading variable name for flags
David Sterba [Mon, 29 Apr 2013 13:39:40 +0000 (13:39 +0000)]
btrfs: fix misleading variable name for flags

The variable was named 'data' in btrfs_reserve_extent and that's the
only function that actually uses it to let btrfs_get_alloc_profile know
what profile we want. Then it's passed down as u64 flags.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agobtrfs: use unsigned long type for extent state bits
David Sterba [Mon, 29 Apr 2013 13:38:46 +0000 (13:38 +0000)]
btrfs: use unsigned long type for extent state bits

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>